Session 4: What Can Go Wrong — Biases and Diagnostics
Date: Thursday, 21 May 2026, 16:00–19:00
Business question
Does the hotel price–distance relationship hold across all European cities, or does it break down in some? And when can we actually trust a regression output?
Learning goals
- Understand what OLS needs to work well: unbiasedness, efficiency, and the Gauss-Markov framework
- Recognise when the linearity assumption fails and how to address it
- Detect and correct for heteroskedasticity using robust standard errors
- Identify multicollinearity and understand its consequences
- Spot influential observations and assess their impact
- Understand omitted variable bias and why diagnostics cannot detect it
Datasets
Slides and live demo — hotels-europe
Cross-section of 9,648 hotels from 8 European cities (November 2017, complete cases), ranging from Istanbul (median price ~€52) to London (~€171). The wide price variation across city types makes OLS assumption violations clearly visible.
| Variable | Description |
|---|---|
price |
Hotel price per night (EUR) |
distance |
Distance from city centre (km) |
stars |
Star rating (1–5) |
rating |
Guest rating |
city |
City name |
Source: Békés & Kézdi (2021). Download full dataset.
Exercise — CPS earnings 2014
120,434 full-time employed US workers from the 2014 Current Population Survey (CPS), Monthly Outgoing Rotation Group. Used for the in-session exercise: diagnosing a Mincer wage regression.
| Variable | Description |
|---|---|
wage |
Hourly wage (USD) = weekly earnings / usual hours |
earnwke |
Weekly earnings (USD, top-coded at $2,884) |
uhours |
Usual hours worked per week |
age |
Age in years |
educ |
Educational attainment (CPS grade92 scale: 31–46) |
female |
1 = female, 0 = male |
Source: Békés & Kézdi (2021). Full dataset documentation. Dataset included in the exercise repository — no separate download needed.
Session outline
- Recap of previous sessions
- Input: OLS properties (BLUE), functional form, heteroskedasticity
- Live demo: comparing specifications and applying robust SEs
- Break
- Input: multicollinearity, influential observations, normality, OVB
- In-session exercise
- Debrief + Quarto skill: multi-panel figures and
@fig-cross-references
Materials
| File | Description |
|---|---|
| Slides | Lecture slides — open in browser, press F for fullscreen |
| Live demo | Coding document built during the session — complete diagnostic workflow |
| Exercise | In-session exercise: diagnosing and correcting a regression (via Github Classroom) |
| Exercise solution | Example solution (added after session) |
Quarto skill introduced this session
Embedding diagnostic plots in a Quarto report: figure captions, @fig- cross-references, and multi-panel layouts using layout-ncol.
```{r}
#| label: fig-diagnostics
#| fig-cap: "Two key diagnostic plots."
#| fig-subcap:
#| - "Scale-location plot"
#| - "Q-Q plot"
#| layout-ncol: 2
# plot 1
# plot 2
```Cross-reference in text: As shown in @fig-diagnostics-1, …