library(tidyverse)
library(modelsummary)
library(flextable)Lecture Notes: Categorical Variables and Interaction Terms
Session 2 — Extending Take-Home Task 1
1 Introduction
These notes pick up exactly where Take-Home Task 1 left off.
In the task you estimated three models of weekly earnings for market research analysts using age, hours worked, and a binary college indicator. The final reflection question asked which important variable had been left out. The answer is gender: female analysts earn systematically less than male analysts, and because gender is correlated with both age and earnings, leaving it out biased the age coefficients.
This document walks through:
- Adding gender back from the raw CPS data
- Extending the education variable from a binary to a three-category measure
- Fitting an interaction term to ask whether the age–earnings profile looks the same for men and women
- Putting together a clean, publication-ready comparison table
Data: The raw CPS file morg-2014-emp.csv is the same file used in Task 1. If you completed the task via GitHub Classroom, it is in your task repository. Copy it into the same folder as this document before running the code below.
2 Rebuild the Task 1 Dataset
We start from scratch so these notes are self-contained. The data preparation steps are identical to Task 1, with one addition: we keep sex and grade92 before the final select().
cps_raw <- read_csv("data/morg-2014-emp.csv")
cps <- cps_raw |>
# Same filters as Task 1
filter(occ2012 == 735) |> # Market research analysts only
filter(age >= 24, age <= 64) |> # Working-age adults
filter(uhours >= 20) |> # At least part-time
mutate(
# Binary college indicator from Task 1 (keep for comparison)
college = ifelse(grade92 >= 43, 1, 0),
# Gender: CPS codes 1 = Male, 2 = Female
female = ifelse(sex == 2, 1, 0)
) |>
select(earnwke, age, uhours, college, female, grade92) |>
drop_na()Notice that grade92 is still in the dataset — we will use it in Section 3 to build a richer education variable.
3 Adding Gender to the Model
3.1 A quick look at the raw earnings gap
Before running any regression, let us see the raw difference. A simple group summary gives us the starting point.
cps |>
group_by(female) |>
summarise(
n = n(),
mean_earnwke = mean(earnwke),
median_earn = median(earnwke)
) |>
flextable() |>
theme_vanilla()female | n | mean_earnwke | median_earn |
|---|---|---|---|
0 | 101 | 1,417.693 | 1,307.00 |
1 | 151 | 1,199.320 | 1,115.38 |
On average, female analysts earn noticeably less. But this comparison does not hold age or hours constant. The regression will do that.
3.2 Model 3 from the take-home, now including gender
We add female to the full model from Task 1.
# Recreate age-squared (as in Task 1)
cps <- cps |> mutate(age_sq = age^2)
# Models from Task 1
m1 <- lm(earnwke ~ age, data = cps)
m2 <- lm(earnwke ~ age + age_sq, data = cps)
m3 <- lm(earnwke ~ age + age_sq + uhours + college, data = cps)
# New model: add gender
m4 <- lm(earnwke ~ age + age_sq + uhours + college + female, data = cps)
modelsummary(m4, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")| (1) | |
|---|---|
| (Intercept) | -2418.560 |
| (534.353) | |
| age | 86.114 |
| (25.741) | |
| age_sq | -0.843 |
| (0.305) | |
| uhours | 36.316 |
| (4.762) | |
| college | 315.911 |
| (97.195) | |
| female | -83.293 |
| (74.141) | |
| Num.Obs. | 252 |
| R2 | 0.330 |
| R2 Adj. | 0.317 |
| AIC | 3912.9 |
| BIC | 3937.6 |
| Log.Lik. | -1949.431 |
| F | 24.253 |
| RMSE | 553.85 |
3.3 Interpreting the gender coefficient
The coefficient on female is the estimated earnings difference between female and male analysts of the same age, working the same hours, with the same education level.
Read it like this:
“Holding age, hours, and education constant, female market research analysts earn approximately and on average $ -83 per week less than their male counterparts.”
Notice also what happened to the age coefficients when you added gender. Did they shift? This is the omitted variable bias story in action.
Note: While isolating the gender effect is insightful and important, it might also shallow indirect effects of gender inequalities, e.g. those difference that occur because women tend to work less. To identify such more complex relationships we need to take into account interactions, a topic we turn to below.
4 A Richer Education Variable
The binary college variable collapses all of post-secondary education into a single step. But the CPS grade92 variable distinguishes between bachelor’s degrees and graduate degrees. Let us use that.
4.1 Recoding grade92 into three categories
cps <- cps |>
mutate(
edu_cat = case_when(
grade92 < 43 ~ "No degree", # Less than bachelor's
grade92 <= 44 ~ "Bachelor's", # Bachelor's degree
grade92 >= 45 ~ "Graduate" # Master's, professional, doctoral
),
# Set "No degree" as the reference category
edu_cat = factor(edu_cat, levels = c("No degree", "Bachelor's", "Graduate"))
)
# Check how many observations fall in each category
cps |>
count(edu_cat) |>
flextable() |>
theme_vanilla()edu_cat | n |
|---|---|
No degree | 43 |
Bachelor's | 202 |
Graduate | 7 |
4.2 Adding the three-category education variable
m5 <- lm(earnwke ~ age + age_sq + uhours + edu_cat + female, data = cps)
modelsummary(m5, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")| (1) | |
|---|---|
| (Intercept) | -2370.647 |
| (531.433) | |
| age | 84.683 |
| (25.585) | |
| age_sq | -0.838 |
| (0.303) | |
| uhours | 36.289 |
| (4.731) | |
| edu_catBachelor's | 295.116 |
| (97.103) | |
| edu_catGraduate | 739.085 |
| (228.111) | |
| female | -74.917 |
| (73.778) | |
| Num.Obs. | 252 |
| R2 | 0.341 |
| R2 Adj. | 0.325 |
| AIC | 3910.6 |
| BIC | 3938.8 |
| Log.Lik. | -1947.293 |
| F | 21.172 |
| RMSE | 549.17 |
4.2.1 Reading the output
The edu_cat coefficients each represent the earnings difference relative to the “No degree” reference group, holding all other variables constant.
edu_catBachelor's: how much more (or less) a bachelor’s-educated analyst earns compared to one with no degreeedu_catGraduate: how much more (or less) a graduate-educated analyst earns compared to one with no degree
The difference between bachelor’s and graduate earnings is not directly in the table. You calculate it by subtracting one coefficient from the other:
coefs <- coef(m5)
grad_vs_ba <- coefs["edu_catGraduate"] - coefs["edu_catBachelor's"]
grad_vs_baedu_catGraduate
443.969
5 Interaction Terms: Does the Age Profile Differ by Gender?
5.1 Motivation
So far every model assumes the age–earnings profile is the same for men and women — only the level differs (captured by female). This is the “parallel slopes” assumption.
But we might expect the profile to be steeper for men: if women are more likely to take career breaks, their earnings may grow more slowly with age, and the gap may widen over a career.
To check, we first look at the data.
5.2 Visualising the question
cps |>
mutate(gender = ifelse(female == 1, "Female", "Male")) |>
ggplot(aes(x = age, y = earnwke, colour = gender)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_manual(values = c("Male" = "#00395B", "Female" = "#69AACD")) +
labs(
x = "Age",
y = "Weekly earnings (USD)",
colour = NULL,
title = "Are the age–earnings slopes the same for men and women?"
) +
theme_minimal()Look at the two regression lines. Are they roughly parallel? Or does one slope up more steeply than the other? The interaction model will quantify what you see here.
5.3 Fitting the interaction
The * notation in lm() is shorthand: age * female expands to age + female + age:female. The age:female term is the interaction.
m6 <- lm(earnwke ~ age * female + age_sq + uhours + edu_cat, data = cps)
modelsummary(m6, gof_omit = "AIC | BIC | Log.Lik | F | RMSE")| (1) | |
|---|---|
| (Intercept) | -2419.504 |
| (538.403) | |
| age | 84.718 |
| (25.619) | |
| female | 76.860 |
| (264.996) | |
| age_sq | -0.810 |
| (0.306) | |
| uhours | 36.349 |
| (4.738) | |
| edu_catBachelor's | 295.184 |
| (97.231) | |
| edu_catGraduate | 731.087 |
| (228.805) | |
| age × female | -3.871 |
| (6.490) | |
| Num.Obs. | 252 |
| R2 | 0.342 |
| R2 Adj. | 0.324 |
| AIC | 3912.2 |
| BIC | 3944.0 |
| Log.Lik. | -1947.109 |
| F | 18.151 |
| RMSE | 548.77 |
5.4 Interpreting the interaction coefficient
The model now contains three age-related terms:
| Term | What it measures |
|---|---|
age |
slope of age for men (female = 0) |
female |
earnings level gap at age = 0 (not meaningful by itself) |
age:female |
extra slope for women — how the age slope differs for women |
The effect of one more year of age:
- For men: \(\hat{\beta}_\text{age}\)
- For women: \(\hat{\beta}_\text{age} + \hat{\beta}_{\text{age:female}}\)
b <- coef(m6)
slope_men <- b["age"]
slope_women <- b["age"] + b["age:female"]
cat("Age slope for men: ", round(slope_men, 2), "\n")Age slope for men: 84.72
cat("Age slope for women:", round(slope_women, 2), "\n")Age slope for women: 80.85
cat("Difference: ", round(b["age:female"], 2), "\n")Difference: -3.87
Always report the group-specific slopes alongside (or instead of) the interaction coefficient itself — they are what your reader actually needs.
6 Putting It All Together
6.1 A clean comparison table
The table below brings together all six models. This is what a polished comparison table looks like: renamed rows, relevant fit statistics, a source note.
coef_labels <- c(
"age" = "Age",
"age_sq" = "Age²",
"uhours" = "Weekly hours",
"college" = "College degree",
"female" = "Female",
"edu_catBachelor's" = "Bachelor's degree",
"edu_catGraduate" = "Graduate degree",
"age:female" = "Age × Female",
"(Intercept)" = "Constant"
)
modelsummary(
list(
"M1: Age" = m1,
"M2: Quadratic" = m2,
"M3: + Controls" = m3,
"M4: + Gender" = m4,
"M5: + Edu cat." = m5,
"M6: Interaction"= m6
),
coef_map = coef_labels,
gof_map = c("nobs", "r.squared", "adj.r.squared"),
stars = TRUE,
notes = "CPS 2014. Market research analysts and marketing specialists (occ2012 = 735), age 24–64, usual weekly hours >= 20."
)| M1: Age | M2: Quadratic | M3: + Controls | M4: + Gender | M5: + Edu cat. | M6: Interaction | |
|---|---|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||||
| CPS 2014. Market research analysts and marketing specialists (occ2012 = 735), age 24–64, usual weekly hours >= 20. | ||||||
| Age | 14.365*** | 110.171*** | 85.030** | 86.114*** | 84.683** | 84.718** |
| (3.728) | (29.541) | (25.737) | (25.741) | (25.585) | (25.619) | |
| Age² | -1.142** | -0.830** | -0.843** | -0.838** | -0.810** | |
| (0.349) | (0.305) | (0.305) | (0.303) | (0.306) | ||
| Weekly hours | 37.572*** | 36.316*** | 36.289*** | 36.349*** | ||
| (4.631) | (4.762) | (4.731) | (4.738) | |||
| College degree | 313.155** | 315.911** | ||||
| (97.216) | (97.195) | |||||
| Female | -83.293 | -74.917 | 76.860 | |||
| (74.141) | (73.778) | (264.996) | ||||
| Bachelor's degree | 295.116** | 295.184** | ||||
| (97.103) | (97.231) | |||||
| Graduate degree | 739.085** | 731.087** | ||||
| (228.111) | (228.805) | |||||
| Age × Female | -3.871 | |||||
| (6.490) | ||||||
| Constant | 724.712*** | -1133.072+ | -2497.224*** | -2418.560*** | -2370.647*** | -2419.504*** |
| (151.674) | (587.592) | (530.026) | (534.353) | (531.433) | (538.403) | |
| Num.Obs. | 252 | 252 | 252 | 252 | 252 | 252 |
| R2 | 0.056 | 0.095 | 0.327 | 0.330 | 0.341 | 0.342 |
| R2 Adj. | 0.052 | 0.088 | 0.316 | 0.317 | 0.325 | 0.324 |
6.2 Things to notice in the table
R² vs. adjusted R²: R² never decreases when you add variables. Adjusted R² penalises complexity. Compare M3 and M5: does adding the finer education categories actually improve fit?
Coefficient stability: How much does the
agecoefficient change from M2 to M3, and again from M3 to M4? Large shifts signal that the omitted variable was important.The interaction in M6: Is
Age × Femalestatistically significant? Does it change the interpretation of theFemalecoefficient from M4?
6.3 A plain-language summary
Here is an example of how to summarise M6 for a non-technical reader:
Weekly earnings for market research analysts rise with age and experience, but the increase is somewhat slower for women than for men. A female analyst in this sample earns substantially less than a comparable male analyst even after accounting for age, hours worked, and education level. Analysts with graduate degrees earn meaningfully more than those without a bachelor’s degree, with bachelor’s holders falling in between. That said, the analysis is limited to a single occupation in a single year — the patterns may differ in other fields or time periods.
These notes accompany Session 2. This is an extended and commented version of the code that we have developed in the lecture. For questions, post in GitHub Discussions.