library(tidyverse)
library(broom)
library(modelsummary)
library(patchwork)
raw <- read_csv("data/morg-2014-emp.csv", show_col_types = FALSE)Session 5: Causation vs. Correlation — Live Demo
Advanced Data Science · Europa-Universität Flensburg
1 Setup
1.1 Business question
Is there a gender wage gap among US workers? And what would it take to interpret a regression coefficient as evidence of discrimination?
1.2 Packages and data
The raw file contains 149316 observations with 23 variables from the CPS MORG. We keep only what we need and construct the relevant variables:
cps <- raw |>
filter(
!is.na(earnwke), earnwke > 0,
!is.na(uhours), uhours > 0,
!is.na(occ2012), !is.na(grade92),
!is.na(age), !is.na(sex)
) |>
mutate(
wage = earnwke / uhours,
female = as.integer(sex == 2),
educ = grade92,
occ = factor(occ2012)
) |>
filter(wage > 0, wage < 500) |>
select(wage, female, age, educ, occ)
glimpse(cps)Rows: 149,306
Columns: 5
$ wage <dbl> 42.30000, 11.25000, 18.16667, 19.23075, 20.67300, 22.00000, 9.0…
$ female <int> 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, …
$ age <dbl> 29, 27, 30, 48, 46, 29, 25, 36, 56, 49, 50, 39, 42, 25, 39, 57,…
$ educ <dbl> 43, 41, 41, 40, 43, 39, 39, 44, 39, 46, 43, 39, 40, 35, 41, 41,…
$ occ <fct> 630, 5400, 8140, 8255, 5940, 6200, 5100, 2200, 5530, 1530, 3050…
We have 149306 workers after cleaning. The median hourly wage is $19.62 for men and $16.15 for women — a raw gap of 17.7%.
2 The raw picture
2.1 Summary statistics by gender
cps |>
mutate(gender = if_else(female == 1, "Female", "Male")) |>
group_by(gender) |>
summarise(
N = n(),
`Median wage` = median(wage) |> round(2),
`Mean wage` = mean(wage) |> round(2),
`SD wage` = sd(wage) |> round(2),
`Median age` = median(age),
`Median educ` = median(educ)
)# A tibble: 2 × 7
gender N `Median wage` `Mean wage` `SD wage` `Median age` `Median educ`
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 73734 16.2 20.2 14.4 41 41
2 Male 75572 19.6 23.9 15.2 40 40
2.2 Wage distribution by gender
cps |>
mutate(gender = if_else(female == 1, "Female", "Male")) |>
ggplot(aes(log(wage), fill = gender)) +
geom_density(alpha = 0.45, colour = NA) +
scale_fill_manual(values = c("Female" = "tomato", "Male" = "steelblue")) +
labs(
x = "log(hourly wage, USD)",
y = "Density",
fill = NULL,
title = "Wage distribution by gender",
subtitle = paste0("CPS 2014 · n = ", format(n_obs, big.mark = ","))
) +
theme_minimal()There is a clear leftward shift for women, but the distributions overlap substantially. Before concluding this is discrimination, we need to ask: what else differs between men and women in this sample?
3 Three models, three questions
The core claim of this session: every regression answers a specific question. The question changes when you add or remove controls. This is not a matter of some models being “more controlled” — it is a matter of which causal pathway you are estimating.
3.1 Estimating the models
m1 <- lm(log(wage) ~ female, data = cps)
m2 <- lm(log(wage) ~ female + age + educ, data = cps)
m3 <- lm(log(wage) ~ female + age + educ + occ, data = cps)3.2 The side-by-side comparison
modelsummary(
list(
"Model 1: Raw gap" = m1,
"Model 2: + Demographics" = m2,
"Model 3: + Occupation" = m3
),
coef_map = c(
"female" = "Female",
"age" = "Age",
"educ" = "Education (grade92)"
),
gof_omit = "AIC|BIC|Log|F|RMSE",
stars = TRUE,
notes = "Occupation dummies (Model 3) omitted from display."
)| Model 1: Raw gap | Model 2: + Demographics | Model 3: + Occupation | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| Occupation dummies (Model 3) omitted from display. | |||
| Female | -0.164*** | -0.212*** | -0.138*** |
| (0.003) | (0.003) | (0.003) | |
| Age | 0.012*** | 0.009*** | |
| (0.000) | (0.000) | ||
| Education (grade92) | 0.109*** | 0.060*** | |
| (0.001) | (0.001) | ||
| Num.Obs. | 149306 | 149306 | 149306 |
| R2 | 0.016 | 0.271 | 0.384 |
| R2 Adj. | 0.016 | 0.271 | 0.382 |
The female coefficient moves from -0.164 (Model 1) to -0.212 (Model 2) to -0.138 (Model 3). This does not mean Model 3 is the most correct. Each model answers a different question.
4 DAG-based interpretation
4.1 What each model claims
Model 1 — the raw gap
\[\log(\text{wage}_i) = \alpha + \beta_1 \cdot \text{female}_i + \varepsilon_i\]
This is a pure descriptive statement: female workers earn, on average, 16.4% less per hour in log terms. It makes no causal claim. The coefficient absorbs everything that co-varies with gender: age, education, occupation, unobserved family background, and actual pay discrimination.
Model 2 — controlling for demographics
\[\log(\text{wage}_i) = \alpha + \beta_1 \cdot \text{female}_i + \beta_2 \cdot \text{age}_i + \beta_3 \cdot \text{educ}_i + \varepsilon_i\]
Age and education are plausible confounders: they affect wages and are systematically related to gender (the sample shows some difference in educational composition). Adding them closes those back-door paths — but only if age and education are truly confounders and not mediators.
Are age and education confounders or mediators here? They are treated as confounders — they affect wages independently of gender, and they are not caused by gender. This is defensible if we think of “gender” as a fixed attribute the worker enters the labour market with.
Model 3 — adding occupation
\[\log(\text{wage}_i) = \alpha + \beta_1 \cdot \text{female}_i + \beta_2 \cdot \text{age}_i + \beta_3 \cdot \text{educ}_i + \sum_k \gamma_k \cdot \text{occ}_{ik} + \varepsilon_i\]
This is the key model — and the key trap.
If gender discrimination causes occupational sorting (women are pushed toward lower-paying occupations by social and structural forces), then occupation is a mediator: it lies on the causal path from gender to wage. Controlling for it blocks part of the very effect we are trying to measure.
The coefficient -0.138 is not a more precise estimate of discrimination. It is an estimate of the pay gap within occupation groups — the direct channel of discrimination, net of occupational sorting. Whether that narrower quantity is what we want depends entirely on the research question.
4.2 The bad control in plain terms
The bad control lesson: Adding occupation does not make the estimate better. It answers a different question: “Do women earn less than men in the same occupation, at the same age and education level?” That is sometimes the right question. But if we want to know the total effect of gender on wages — including the sorting mechanism — Model 3 gives us the wrong answer, not a better one.
tibble(
model = factor(
c("Model 1\nRaw gap", "Model 2\n+ Demographics", "Model 3\n+ Occupation"),
levels = c("Model 1\nRaw gap", "Model 2\n+ Demographics", "Model 3\n+ Occupation")
),
estimate = c(gap_m1, gap_m2, gap_m3),
se = c(
summary(m1)$coefficients["female", "Std. Error"],
summary(m2)$coefficients["female", "Std. Error"],
summary(m3)$coefficients["female", "Std. Error"]
)
) |>
ggplot(aes(model, estimate)) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "gray50") +
geom_pointrange(
aes(ymin = estimate - 1.96 * se, ymax = estimate + 1.96 * se),
colour = "steelblue", linewidth = 0.8, size = 0.7
) +
labs(
x = NULL,
y = "Coefficient on Female",
title = "Gender wage gap across specifications",
subtitle = "95% confidence intervals shown"
) +
theme_minimal()5 The identification gap
5.1 Why none of these models proves discrimination
Each model is a conditional association. Even Model 2 — our best effort at controlling for observable confounders — cannot rule out unobserved differences between men and women that drive both wages and are correlated with gender: differences in negotiation behaviour, network access, industry composition, or employer-level pay-setting practices, for example.
An identification argument would need to claim: “The variation in gender I am exploiting is as good as random with respect to wages, conditional on my controls.” No observational regression on this data can make that claim credibly.
This is not a failure of the data or the technique. It is a property of the question. Gender is not randomly assigned, and the labour market is not an experiment.
5.2 What would help?
Two strategies introduced in upcoming sessions address similar problems:
Panel fixed effects (Session 6): if we had panel data on the same workers over time, we could control for all stable unobserved worker-level characteristics. This would eliminate time-invariant confounders but not time-varying ones.
Randomised experiments (Session 7): résumé audit studies — randomly assigning gender-signalling names to identical applications — provide clean causal identification of name-based discrimination. This is a real identification strategy for one narrow channel of the gap.
Neither strategy eliminates all confounding for the full gender wage gap. The point is that credibility comes from the research design, not from adding more controls.
6 Summary
| Model | Controls | Question answered | Causal? |
|---|---|---|---|
| 1 | None | Average raw difference in log wages | No — pure association |
| 2 | Age, education | Gap net of demographic composition | Partially — if no unobserved confounders |
| 3 | + Occupation | Gap within occupation, conditional on demographics | Different question — occupation is a mediator |
The key insight: the coefficient shrinks across models not because we are getting closer to the truth, but because we are changing the question. Understanding which question each model answers is the skill this session develops.