Session 4: What Can Go Wrong — Biases and Diagnostics

Published

21 05 2026

Modified

12 05 2026

Note

Date: Thursday, 21 May 2026, 16:00–19:00

Business question

Does the hotel price–distance relationship hold across all European cities, or does it break down in some? And when can we actually trust a regression output?

Learning goals

  • Understand what OLS needs to work well: unbiasedness, efficiency, and the Gauss-Markov framework
  • Recognise when the linearity assumption fails and how to address it
  • Detect and correct for heteroskedasticity using robust standard errors
  • Identify multicollinearity and understand its consequences
  • Spot influential observations and assess their impact
  • Understand omitted variable bias and why diagnostics cannot detect it

Datasets

Slides and live demo — hotels-europe

Cross-section of 9,648 hotels from 8 European cities (November 2017, complete cases), ranging from Istanbul (median price ~€52) to London (~€171). The wide price variation across city types makes OLS assumption violations clearly visible.

Variable Description
price Hotel price per night (EUR)
distance Distance from city centre (km)
stars Star rating (1–5)
rating Guest rating
city City name

Source: Békés & Kézdi (2021). Download full dataset.

Exercise — CPS earnings 2014

120,434 full-time employed US workers from the 2014 Current Population Survey (CPS), Monthly Outgoing Rotation Group. Used for the in-session exercise: diagnosing a Mincer wage regression.

Variable Description
wage Hourly wage (USD) = weekly earnings / usual hours
earnwke Weekly earnings (USD, top-coded at $2,884)
uhours Usual hours worked per week
age Age in years
educ Educational attainment (CPS grade92 scale: 31–46)
female 1 = female, 0 = male

Source: Békés & Kézdi (2021). Full dataset documentation. Dataset included in the exercise repository — no separate download needed.

Session outline

  • Recap of previous sessions
  • Input: OLS properties (BLUE), functional form, heteroskedasticity
  • Live demo: comparing specifications and applying robust SEs
  • Break
  • Input: multicollinearity, influential observations, normality, OVB
  • In-session exercise
  • Debrief + Quarto skill: multi-panel figures and @fig- cross-references

Materials

File Description
Slides Lecture slides — open in browser, press F for fullscreen
Live demo Coding document built during the session — complete diagnostic workflow
Exercise In-session exercise: diagnosing and correcting a regression (via Github Classroom)
Exercise solution Example solution (added after session)

Quarto skill introduced this session

Embedding diagnostic plots in a Quarto report: figure captions, @fig- cross-references, and multi-panel layouts using layout-ncol.

```{r}
#| label: fig-diagnostics
#| fig-cap: "Two key diagnostic plots."
#| fig-subcap:
#|   - "Scale-location plot"
#|   - "Q-Q plot"
#| layout-ncol: 2

# plot 1
# plot 2
```

Cross-reference in text: As shown in @fig-diagnostics-1, …