Lecture Notes: Categorical Variables and Interaction Terms

Session 2 — Extending Take-Home Task 1

Published

09 04 2026

library(tidyverse)
library(modelsummary)
library(flextable)

1 Introduction

These notes pick up exactly where Take-Home Task 1 left off.

In the task you estimated three models of weekly earnings for market research analysts using age, hours worked, and a binary college indicator. The final reflection question asked which important variable had been left out. The answer is gender: female analysts earn systematically less than male analysts, and because gender is correlated with both age and earnings, leaving it out biased the age coefficients.

This document walks through:

  1. Adding gender back from the raw CPS data
  2. Extending the education variable from a binary to a three-category measure
  3. Fitting an interaction term to ask whether the age–earnings profile looks the same for men and women
  4. Putting together a clean, publication-ready comparison table

Data: The raw CPS file morg-2014-emp.csv is the same file used in Task 1. If you completed the task via GitHub Classroom, it is in your task repository. Copy it into the same folder as this document before running the code below.

2 Rebuild the Task 1 Dataset

We start from scratch so these notes are self-contained. The data preparation steps are identical to Task 1, with one addition: we keep sex and grade92 before the final select().

cps_raw <- read_csv("data/morg-2014-emp.csv")

cps <- cps_raw |>
  # Same filters as Task 1
  filter(occ2012 == 735) |>        # Market research analysts only
  filter(age >= 24, age <= 64) |>  # Working-age adults
  filter(uhours >= 20) |>          # At least part-time
  mutate(
    # Binary college indicator from Task 1 (keep for comparison)
    college = ifelse(grade92 >= 43, 1, 0),
    # Gender: CPS codes 1 = Male, 2 = Female
    female  = ifelse(sex == 2, 1, 0)
  ) |>
  select(earnwke, age, uhours, college, female, grade92) |>
  drop_na()

Notice that grade92 is still in the dataset — we will use it in Section 3 to build a richer education variable.

3 Adding Gender to the Model

3.1 A quick look at the raw earnings gap

Before running any regression, let us see the raw difference. A simple group summary gives us the starting point.

cps |>
  group_by(female) |>
  summarise(
    n            = n(),
    mean_earnwke = mean(earnwke),
    median_earn  = median(earnwke)
  ) |> 
  flextable() |> 
  theme_vanilla()

female

n

mean_earnwke

median_earn

0

101

1,417.693

1,307.00

1

151

1,199.320

1,115.38

On average, female analysts earn noticeably less. But this comparison does not hold age or hours constant. The regression will do that.

3.2 Model 3 from the take-home, now including gender

We add female to the full model from Task 1.

# Recreate age-squared (as in Task 1)
cps <- cps |> mutate(age_sq = age^2)

# Models from Task 1
m1 <- lm(earnwke ~ age,                             data = cps)
m2 <- lm(earnwke ~ age + age_sq,                    data = cps)
m3 <- lm(earnwke ~ age + age_sq + uhours + college, data = cps)

# New model: add gender
m4 <- lm(earnwke ~ age + age_sq + uhours + college + female, data = cps)

modelsummary(m4, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")
(1)
(Intercept) -2418.560
(534.353)
age 86.114
(25.741)
age_sq -0.843
(0.305)
uhours 36.316
(4.762)
college 315.911
(97.195)
female -83.293
(74.141)
Num.Obs. 252
R2 0.330
R2 Adj. 0.317
AIC 3912.9
BIC 3937.6
Log.Lik. -1949.431
F 24.253
RMSE 553.85

3.3 Interpreting the gender coefficient

The coefficient on female is the estimated earnings difference between female and male analysts of the same age, working the same hours, with the same education level.

Read it like this:

“Holding age, hours, and education constant, female market research analysts earn approximately and on average $ -83 per week less than their male counterparts.”

Notice also what happened to the age coefficients when you added gender. Did they shift? This is the omitted variable bias story in action.

Note: While isolating the gender effect is insightful and important, it might also shallow indirect effects of gender inequalities, e.g. those difference that occur because women tend to work less. To identify such more complex relationships we need to take into account interactions, a topic we turn to below.

4 A Richer Education Variable

The binary college variable collapses all of post-secondary education into a single step. But the CPS grade92 variable distinguishes between bachelor’s degrees and graduate degrees. Let us use that.

4.1 Recoding grade92 into three categories

cps <- cps |>
  mutate(
    edu_cat = case_when(
      grade92 < 43  ~ "No degree",       # Less than bachelor's
      grade92 <= 44 ~ "Bachelor's",      # Bachelor's degree
      grade92 >= 45 ~ "Graduate"         # Master's, professional, doctoral
    ),
    # Set "No degree" as the reference category
    edu_cat = factor(edu_cat, levels = c("No degree", "Bachelor's", "Graduate"))
  )

# Check how many observations fall in each category
cps |> 
  count(edu_cat) |> 
  flextable() |> 
  theme_vanilla()

edu_cat

n

No degree

43

Bachelor's

202

Graduate

7

4.2 Adding the three-category education variable

m5 <- lm(earnwke ~ age + age_sq + uhours + edu_cat + female, data = cps)

modelsummary(m5, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")
(1)
(Intercept) -2370.647
(531.433)
age 84.683
(25.585)
age_sq -0.838
(0.303)
uhours 36.289
(4.731)
edu_catBachelor's 295.116
(97.103)
edu_catGraduate 739.085
(228.111)
female -74.917
(73.778)
Num.Obs. 252
R2 0.341
R2 Adj. 0.325
AIC 3910.6
BIC 3938.8
Log.Lik. -1947.293
F 21.172
RMSE 549.17

4.2.1 Reading the output

The edu_cat coefficients each represent the earnings difference relative to the “No degree” reference group, holding all other variables constant.

  • edu_catBachelor's: how much more (or less) a bachelor’s-educated analyst earns compared to one with no degree
  • edu_catGraduate: how much more (or less) a graduate-educated analyst earns compared to one with no degree

The difference between bachelor’s and graduate earnings is not directly in the table. You calculate it by subtracting one coefficient from the other:

coefs <- coef(m5)
grad_vs_ba <- coefs["edu_catGraduate"] - coefs["edu_catBachelor's"]
grad_vs_ba
edu_catGraduate 
        443.969 

5 Interaction Terms: Does the Age Profile Differ by Gender?

5.1 Motivation

So far every model assumes the age–earnings profile is the same for men and women — only the level differs (captured by female). This is the “parallel slopes” assumption.

But we might expect the profile to be steeper for men: if women are more likely to take career breaks, their earnings may grow more slowly with age, and the gap may widen over a career.

To check, we first look at the data.

5.2 Visualising the question

cps |>
  mutate(gender = ifelse(female == 1, "Female", "Male")) |>
  ggplot(aes(x = age, y = earnwke, colour = gender)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_manual(values = c("Male" = "#00395B", "Female" = "#69AACD")) +
  labs(
    x      = "Age",
    y      = "Weekly earnings (USD)",
    colour = NULL,
    title  = "Are the age–earnings slopes the same for men and women?"
  ) +
  theme_minimal()
Figure 1: Age and weekly earnings by gender, market research analysts (CPS 2014)

Look at the two regression lines. Are they roughly parallel? Or does one slope up more steeply than the other? The interaction model will quantify what you see here.

5.3 Fitting the interaction

The * notation in lm() is shorthand: age * female expands to age + female + age:female. The age:female term is the interaction.

m6 <- lm(earnwke ~ age * female + age_sq + uhours + edu_cat, data = cps)

modelsummary(m6, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")
(1)
(Intercept) -2419.504
(538.403)
age 84.718
(25.619)
female 76.860
(264.996)
age_sq -0.810
(0.306)
uhours 36.349
(4.738)
edu_catBachelor's 295.184
(97.231)
edu_catGraduate 731.087
(228.805)
age × female -3.871
(6.490)
Num.Obs. 252
R2 0.342
R2 Adj. 0.324
AIC 3912.2
BIC 3944.0
Log.Lik. -1947.109
F 18.151
RMSE 548.77

5.4 Interpreting the interaction coefficient

The model now contains three age-related terms:

Term What it measures
age slope of age for men (female = 0)
female earnings level gap at age = 0 (not meaningful by itself)
age:female extra slope for women — how the age slope differs for women

The effect of one more year of age:

  • For men: \(\hat{\beta}_\text{age}\)
  • For women: \(\hat{\beta}_\text{age} + \hat{\beta}_{\text{age:female}}\)
b <- coef(m6)

slope_men   <- b["age"]
slope_women <- b["age"] + b["age:female"]

cat("Age slope for men:  ", round(slope_men,   2), "\n")
Age slope for men:   84.72 
cat("Age slope for women:", round(slope_women, 2), "\n")
Age slope for women: 80.85 
cat("Difference:         ", round(b["age:female"], 2), "\n")
Difference:          -3.87 

Always report the group-specific slopes alongside (or instead of) the interaction coefficient itself — they are what your reader actually needs.

6 Putting It All Together

6.1 A clean comparison table

The table below brings together all six models. This is what a polished comparison table looks like: renamed rows, relevant fit statistics, a source note.

coef_labels <- c(
  "age"                = "Age",
  "age_sq"             = "Age²",
  "uhours"             = "Weekly hours",
  "college"            = "College degree",
  "female"             = "Female",
  "edu_catBachelor's"  = "Bachelor's degree",
  "edu_catGraduate"    = "Graduate degree",
  "age:female"         = "Age × Female",
  "(Intercept)"        = "Constant"
)

modelsummary(
  list(
    "M1: Age"        = m1,
    "M2: Quadratic"  = m2,
    "M3: + Controls" = m3,
    "M4: + Gender"   = m4,
    "M5: + Edu cat." = m5,
    "M6: Interaction"= m6
  ),
  coef_map = coef_labels,
  gof_map  = c("nobs", "r.squared", "adj.r.squared"),
  stars    = TRUE,
  notes    = "CPS 2014. Market research analysts and marketing specialists (occ2012 = 735), age 24–64, usual weekly hours >= 20."
)
Table 1: Weekly earnings of market research analysts: regression results (CPS 2014)
M1: Age M2: Quadratic M3: + Controls M4: + Gender M5: + Edu cat. M6: Interaction
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
CPS 2014. Market research analysts and marketing specialists (occ2012 = 735), age 24–64, usual weekly hours >= 20.
Age 14.365*** 110.171*** 85.030** 86.114*** 84.683** 84.718**
(3.728) (29.541) (25.737) (25.741) (25.585) (25.619)
Age² -1.142** -0.830** -0.843** -0.838** -0.810**
(0.349) (0.305) (0.305) (0.303) (0.306)
Weekly hours 37.572*** 36.316*** 36.289*** 36.349***
(4.631) (4.762) (4.731) (4.738)
College degree 313.155** 315.911**
(97.216) (97.195)
Female -83.293 -74.917 76.860
(74.141) (73.778) (264.996)
Bachelor's degree 295.116** 295.184**
(97.103) (97.231)
Graduate degree 739.085** 731.087**
(228.111) (228.805)
Age × Female -3.871
(6.490)
Constant 724.712*** -1133.072+ -2497.224*** -2418.560*** -2370.647*** -2419.504***
(151.674) (587.592) (530.026) (534.353) (531.433) (538.403)
Num.Obs. 252 252 252 252 252 252
R2 0.056 0.095 0.327 0.330 0.341 0.342
R2 Adj. 0.052 0.088 0.316 0.317 0.325 0.324

6.2 Things to notice in the table

  1. R² vs. adjusted R²: R² never decreases when you add variables. Adjusted R² penalises complexity. Compare M3 and M5: does adding the finer education categories actually improve fit?

  2. Coefficient stability: How much does the age coefficient change from M2 to M3, and again from M3 to M4? Large shifts signal that the omitted variable was important.

  3. The interaction in M6: Is Age × Female statistically significant? Does it change the interpretation of the Female coefficient from M4?

6.3 A plain-language summary

Here is an example of how to summarise M6 for a non-technical reader:

Weekly earnings for market research analysts rise with age and experience, but the increase is somewhat slower for women than for men. A female analyst in this sample earns substantially less than a comparable male analyst even after accounting for age, hours worked, and education level. Analysts with graduate degrees earn meaningfully more than those without a bachelor’s degree, with bachelor’s holders falling in between. That said, the analysis is limited to a single occupation in a single year — the patterns may differ in other fields or time periods.


These notes accompany Session 2. This is an extended and commented version of the code that we have developed in the lecture. For questions, post in GitHub Discussions.