library(tidyverse)
library(modelsummary)
library(marginaleffects)
library(broom)Session 3: Modelling Binary and Categorical Outcomes
Session 3 · Quantitative Data Analysis
1 Introduction
1.1 Business question
Can we predict which firms are likely to exit the market in the next year?
This question is highly relevant in business practice: investors, banks, and regulators all have an interest in understanding which firms are at risk of failure. But answering it with standard regression runs into an immediate problem — the outcome we care about, firm exit, is a binary variable: a firm either exits (1) or it doesn’t (0).
In this session, we work through why ordinary least squares (OLS) breaks down for such outcomes, develop the intuition for logistic regression, and learn how to estimate, interpret, and communicate logistic regression results in R.
1.2 Packages and data
We use the bisnode-firms dataset from the open-access textbook by Békés & Kézdi (2021). It contains 19,036 Hungarian manufacturing and services firms, observed over time. Our outcome variable is exit, a binary indicator equal to 1 if the firm closed in the following year.
firms <- read_csv("data/bisnode-firms.csv")
glimpse(firms)Rows: 1,916
Columns: 11
$ comp_id <dbl> 1022796, 1340059, 1627539, 1885666, 2407000, 8516875, 10…
$ exit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ sales_log <dbl> 10.169549, 10.440306, 11.355318, 9.204397, 10.806355, 9.…
$ employees_log <dbl> -2.48490662, -2.39789524, -2.48490662, -2.48490662, -2.0…
$ profit_margin <dbl> -0.050241269, -0.119844111, -0.252612413, 0.087928463, 0…
$ age <dbl> 11, 2, 1, 3, 2, 5, 12, 4, 2, 6, 2, 1, 23, 21, 1, 6, 3, 9…
$ liquidity <dbl> 0.90785164, 0.35994765, 0.48970007, 0.55894003, 0.848029…
$ foreign <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ female_ceo <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,…
$ industry <dbl> 56, 56, 56, 56, 56, 56, 29, 47, 56, 56, 56, 56, 56, 56, …
$ urban <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
Let us first look at the outcome variable:
firms |>
count(exit) |>
mutate(share = n / sum(n))# A tibble: 2 × 3
exit n share
<dbl> <int> <dbl>
1 0 1847 0.964
2 1 69 0.0360
Notice that exit events are relatively rare — this is typical of survival data and has implications for model evaluation that we will revisit in Session 9 (prediction vs. inference).
A quick look at the main predictors:
firms |>
select(exit, sales_log, employees_log, profit_margin, age) |>
summary() exit sales_log employees_log profit_margin
Min. :0.00000 Min. : 1.309 Min. :-2.485 Min. :-1.0000000
1st Qu.:0.00000 1st Qu.: 9.130 1st Qu.:-2.485 1st Qu.:-0.7208847
Median :0.00000 Median :10.143 Median :-1.792 Median :-0.2424290
Mean :0.03601 Mean :10.060 Mean :-1.541 Mean :-0.3427691
3rd Qu.:0.00000 3rd Qu.:11.093 3rd Qu.:-1.011 3rd Qu.: 0.0004585
Max. :1.00000 Max. :17.850 Max. : 3.461 Max. : 1.0000000
age
Min. : 0.000
1st Qu.: 1.000
Median : 3.000
Mean : 5.872
3rd Qu.: 9.000
Max. :23.000
2 Why OLS Fails for Binary Outcomes
2.1 The Linear Probability Model
The simplest approach is to regress the binary outcome directly on predictors using OLS. This is called the Linear Probability Model (LPM):
\[\text{exit}_i = \beta_0 + \beta_1 \cdot \text{sales\_log}_i + \beta_2 \cdot \text{profit\_margin}_i + \varepsilon_i\]
lpm <- lm(exit ~ sales_log + profit_margin + employees_log,
data = firms)
summary(lpm)
Call:
lm(formula = exit ~ sales_log + profit_margin + employees_log,
data = firms)
Residuals:
Min 1Q Median 3Q Max
-0.16173 -0.04660 -0.03429 -0.02043 0.99650
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.187786 0.040547 4.631 3.88e-06 ***
sales_log -0.010614 0.003248 -3.268 0.00110 **
profit_margin 0.043991 0.011029 3.989 6.90e-05 ***
employees_log 0.019413 0.005690 3.411 0.00066 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1854 on 1912 degrees of freedom
Multiple R-squared: 0.01193, Adjusted R-squared: 0.01038
F-statistic: 7.697 on 3 and 1912 DF, p-value: 4.117e-05
The coefficient interpretation is straightforward: a one-unit increase in profit_margin is associated with a 0.044 change in the probability of exit.
2.1.1 The problem: out-of-bounds predictions
lpm_fitted <- augment(lpm, data = firms)
lpm_fitted |>
summarise(
min_pred = min(.fitted),
max_pred = max(.fitted),
n_below_zero = sum(.fitted < 0),
n_above_one = sum(.fitted > 1)
)# A tibble: 1 × 4
min_pred max_pred n_below_zero n_above_one
<dbl> <dbl> <int> <int>
1 -0.0225 0.176 49 0
We have 49 predicted values below zero and 0 above one. These are mathematically impossible as probabilities.
ggplot(lpm_fitted, aes(x = sales_log, y = .fitted)) +
geom_point(alpha = 0.2, size = 0.8, color = "#2c7bb6") +
geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "firebrick") +
annotate("label", x = max(firms$sales_log, na.rm = TRUE) - 1,
y = -0.05, label = "Impossible (P < 0)",
color = "firebrick", size = 3) +
labs(
title = "LPM: Fitted probabilities of firm exit",
x = "Log sales", y = "Predicted P(exit = 1)"
) +
theme_minimal()2.1.2 The problem: heteroskedastic errors
When Y is binary, the variance of the error term depends on X by construction:
\[\text{Var}(\varepsilon_i | x_i) = P_i(1 - P_i)\]
This means OLS standard errors are not valid, so hypothesis tests and confidence intervals are unreliable. We can verify this visually — the residuals should show a clear fan pattern.
ggplot(lpm_fitted, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.2, size = 0.8) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "LPM residuals vs. fitted values",
x = "Fitted value", y = "Residual") +
theme_minimal()Despite these problems, the LPM is widely used in economics, especially with panel data. It can serve as a useful benchmark. But you should always be aware of its limitations.
3 Logistic Regression: Intuition
3.1 The logistic function
We need a function that maps any real number to a probability in [0, 1]. The logistic (sigmoid) function does exactly this:
\[\sigma(z) = \frac{e^z}{1 + e^z}\]
sigma <- function(z) exp(z) / (1 + exp(z))
tibble(z = seq(-6, 6, by = 0.05)) |>
mutate(prob = sigma(z)) |>
ggplot(aes(z, prob)) +
geom_line(color = "#1a9641", linewidth = 1.3) +
geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", color = "grey60") +
labs(title = "The logistic function",
x = "z (linear combination of predictors)",
y = "Predicted probability") +
theme_minimal()The idea: let \(z = \beta_0 + \beta_1 x_1 + \ldots\) be the linear combination, and then:
\[\hat{P}(\text{exit} = 1 \mid \mathbf{x}) = \sigma(z) = \frac{e^{\mathbf{X}\boldsymbol{\beta}}}{1 + e^{\mathbf{X}\boldsymbol{\beta}}}\]
3.2 From the sigmoid to the logit
Rearranging the expression above gives the logit link:
\[\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]
The left-hand side is the log-odds. The relationship between probability, odds, and log-odds:
| P(exit) | Odds | Log-odds |
|---|---|---|
| 0.10 | 0.11 | −2.20 |
| 0.25 | 0.33 | −1.10 |
| 0.50 | 1.00 | 0.00 |
| 0.75 | 3.00 | 1.10 |
| 0.90 | 9.00 | 2.20 |
3.3 Maximum likelihood estimation
Logistic regression is not estimated by OLS but by maximum likelihood estimation (MLE). The idea: find the coefficient values β that make the observed data most probable. In practice, R’s glm() handles this for you.
4 Logistic Regression in R
4.1 Fitting the model
The syntax is almost identical to lm(). The key difference is family = binomial:
logit_model <- glm(exit ~ sales_log + profit_margin + employees_log,
family = binomial,
data = firms)
summary(logit_model)
Call:
glm(formula = exit ~ sales_log + profit_margin + employees_log,
family = binomial, data = firms)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.13919 0.89575 -0.155 0.87652
sales_log -0.21764 0.07365 -2.955 0.00313 **
profit_margin 1.05809 0.28546 3.707 0.00021 ***
employees_log 0.46545 0.14679 3.171 0.00152 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 594.18 on 1915 degrees of freedom
Residual deviance: 574.13 on 1912 degrees of freedom
AIC: 582.13
Number of Fisher Scoring iterations: 6
If you forget family = binomial, R will silently fit a Gaussian (OLS) model on your binary outcome. Always check your model specification.
4.2 Comparing LPM and logit side by side
modelsummary(
list("LPM" = lpm, "Logit" = logit_model),
stars = TRUE,
gof_map = c("nobs", "r.squared", "logLik", "AIC"),
title = "Firm exit: LPM vs. logistic regression"
)| LPM | Logit | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 0.188*** | -0.139 |
| (0.041) | (0.896) | |
| sales_log | -0.011** | -0.218** |
| (0.003) | (0.074) | |
| profit_margin | 0.044*** | 1.058*** |
| (0.011) | (0.285) | |
| employees_log | 0.019*** | 0.465** |
| (0.006) | (0.147) | |
| Num.Obs. | 1916 | 1916 |
| R2 | 0.012 | |
| Log.Lik. | 512.236 | -287.063 |
Note: the logit model reports log-likelihood and AIC rather than R². These are the relevant goodness-of-fit statistics for maximum likelihood models.
5 Interpretation
5.1 Level 1: Log-odds (raw coefficients)
coef(logit_model) (Intercept) sales_log profit_margin employees_log
-0.1391854 -0.2176396 1.0580916 0.4654490
The coefficient on profit_margin is 1.058.
Interpretation: a one-unit increase in profit_margin is associated with a change of 1.058 in the log-odds of exit. This tells us the direction clearly, but the magnitude is not intuitive.
5.2 Level 2: Odds ratios
Exponentiate the coefficients to get odds ratios:
exp(coef(logit_model)) (Intercept) sales_log profit_margin employees_log
0.8700667 0.8044153 2.8808680 1.5927292
The odds ratio for profit_margin is 2.881.
Interpretation: a one-unit increase in profit margin multiplies the odds of exit by 2.881, i.e., it reduces the odds by -188.1%.
We can add confidence intervals:
exp(confint(logit_model)) 2.5 % 97.5 %
(Intercept) 0.1367246 4.6964981
sales_log 0.6998908 0.9358074
profit_margin 1.6423593 5.0529463
employees_log 1.1880168 2.1147743
5.3 Level 3: Predicted probabilities (most useful)
5.3.1 For observed data
firms <- firms |>
mutate(pred_prob = predict(logit_model, type = "response"))
firms |>
group_by(exit) |>
summarise(mean_pred_prob = mean(pred_prob))# A tibble: 2 × 2
exit mean_pred_prob
<dbl> <dbl>
1 0 0.0355
2 1 0.0490
Firms that actually exited had, on average, a predicted exit probability of 0.049, compared to 0.036 for firms that survived. The model is doing something sensible.
5.3.2 For specific scenarios
This is often the most useful form for communicating results:
scenarios <- tibble(
label = c("Profitable, large", "Loss-making, small"),
sales_log = c(11, 9),
employees_log = c(4, 2),
profit_margin = c(0.10, -0.15)
)
scenarios <- scenarios |>
mutate(
pred_prob = predict(logit_model, newdata = scenarios, type = "response")
)
scenarios# A tibble: 2 × 5
label sales_log employees_log profit_margin pred_prob
<chr> <dbl> <dbl> <dbl> <dbl>
1 Profitable, large 11 4 0.1 0.362
2 Loss-making, small 9 2 -0.15 0.210
A profitable, large firm has a predicted exit probability of 0.362, compared to 0.21 for a small, loss-making firm.
5.3.3 Visualising predicted probabilities
# How does exit probability vary with profit margin?
grid <- tibble(
profit_margin = seq(-0.5, 0.5, by = 0.01),
sales_log = mean(firms$sales_log, na.rm = TRUE),
employees_log = mean(firms$employees_log, na.rm = TRUE)
)
grid <- grid |>
mutate(pred_prob = predict(logit_model, newdata = grid, type = "response"))
ggplot(grid, aes(x = profit_margin, y = pred_prob)) +
geom_line(color = "#1a9641", linewidth = 1.2) +
geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "grey60") +
labs(
title = "Predicted probability of exit vs. profit margin",
subtitle = "Other variables held at their mean",
x = "Profit margin", y = "P(exit = 1)"
) +
theme_minimal()5.4 Average Marginal Effects
A cleaner summary of “the effect of X on P(exit)” is the Average Marginal Effect (AME). Because logit is non-linear, marginal effects differ across observations. The AME averages them:
avg_slopes(logit_model)
Term Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
employees_log 0.01592 0.00524 3.04 0.00239 8.7 0.00565 0.02619
profit_margin 0.03618 0.01033 3.50 < 0.001 11.1 0.01593 0.05644
sales_log -0.00744 0.00260 -2.86 0.00426 7.9 -0.01255 -0.00234
Type: response
Comparison: dY/dX
The AME for profit_margin tells us: on average across all firms, a one-unit increase in profit margin is associated with a 0.036 change in the probability of exit. This is directly comparable to the LPM coefficient.
6 Quarto Skill: Inline R Code
When you write a report, you want your numbers to update automatically when the data or model changes. Use inline R code:
Instead of typing “0.65” manually, write:
The odds ratio for profit margin is
2.88.This renders as: “The odds ratio for profit margin is 2.88.”
If you re-estimate the model with different variables or data, the number in the text updates automatically. This is one of the most important professional habits in reproducible data analysis.
7 Summary
- OLS on binary outcomes produces out-of-bounds predictions and invalid standard errors
- Logistic regression uses the sigmoid function to ensure \(\hat{P} \in [0, 1]\)
- Estimation is via maximum likelihood using
glm(..., family = binomial) - Raw coefficients are in log-odds space — hard to interpret directly
- Odds ratios \(= e^\beta\) — easier, but still require explanation
- Predicted probabilities — most useful for business communication; compute with
predict(..., type = "response") - Average marginal effects via
marginaleffects::avg_slopes()— a cleaner, directly comparable effect size - Inline R code ensures your text and numbers stay in sync
8 Further Reading
- Békés, G. & Kézdi, G. (2021). Data Analysis for Business, Economics, and Policy. Ch. 11. Available at gabors-data-analysis.com
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Ch. 4 (Classification). Available at statlearning.com