Session 3: Modelling Binary and Categorical Outcomes

Session 3 · Quantitative Data Analysis

Author

Claudius Gräbner-Radkowitsch

Published

30 04 2026

1 Introduction

1.1 Business question

Can we predict which firms are likely to exit the market in the next year?

This question is highly relevant in business practice: investors, banks, and regulators all have an interest in understanding which firms are at risk of failure. But answering it with standard regression runs into an immediate problem — the outcome we care about, firm exit, is a binary variable: a firm either exits (1) or it doesn’t (0).

In this session, we work through why ordinary least squares (OLS) breaks down for such outcomes, develop the intuition for logistic regression, and learn how to estimate, interpret, and communicate logistic regression results in R.

1.2 Packages and data

library(tidyverse)
library(modelsummary)
library(marginaleffects)
library(broom)

We use the bisnode-firms dataset from the open-access textbook by Békés & Kézdi (2021). It contains 19,036 Hungarian manufacturing and services firms, observed over time. Our outcome variable is exit, a binary indicator equal to 1 if the firm closed in the following year.

firms <- read_csv("data/bisnode-firms.csv")
glimpse(firms)
Rows: 1,916
Columns: 11
$ comp_id       <dbl> 1022796, 1340059, 1627539, 1885666, 2407000, 8516875, 10…
$ exit          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ sales_log     <dbl> 10.169549, 10.440306, 11.355318, 9.204397, 10.806355, 9.…
$ employees_log <dbl> -2.48490662, -2.39789524, -2.48490662, -2.48490662, -2.0…
$ profit_margin <dbl> -0.050241269, -0.119844111, -0.252612413, 0.087928463, 0…
$ age           <dbl> 11, 2, 1, 3, 2, 5, 12, 4, 2, 6, 2, 1, 23, 21, 1, 6, 3, 9…
$ liquidity     <dbl> 0.90785164, 0.35994765, 0.48970007, 0.55894003, 0.848029…
$ foreign       <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ female_ceo    <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,…
$ industry      <dbl> 56, 56, 56, 56, 56, 56, 29, 47, 56, 56, 56, 56, 56, 56, …
$ urban         <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…

Let us first look at the outcome variable:

firms |>
  count(exit) |>
  mutate(share = n / sum(n))
# A tibble: 2 × 3
   exit     n  share
  <dbl> <int>  <dbl>
1     0  1847 0.964 
2     1    69 0.0360
Note

Notice that exit events are relatively rare — this is typical of survival data and has implications for model evaluation that we will revisit in Session 9 (prediction vs. inference).

A quick look at the main predictors:

firms |>
  select(exit, sales_log, employees_log, profit_margin, age) |>
  summary()
      exit           sales_log      employees_log    profit_margin       
 Min.   :0.00000   Min.   : 1.309   Min.   :-2.485   Min.   :-1.0000000  
 1st Qu.:0.00000   1st Qu.: 9.130   1st Qu.:-2.485   1st Qu.:-0.7208847  
 Median :0.00000   Median :10.143   Median :-1.792   Median :-0.2424290  
 Mean   :0.03601   Mean   :10.060   Mean   :-1.541   Mean   :-0.3427691  
 3rd Qu.:0.00000   3rd Qu.:11.093   3rd Qu.:-1.011   3rd Qu.: 0.0004585  
 Max.   :1.00000   Max.   :17.850   Max.   : 3.461   Max.   : 1.0000000  
      age        
 Min.   : 0.000  
 1st Qu.: 1.000  
 Median : 3.000  
 Mean   : 5.872  
 3rd Qu.: 9.000  
 Max.   :23.000  

2 Why OLS Fails for Binary Outcomes

2.1 The Linear Probability Model

The simplest approach is to regress the binary outcome directly on predictors using OLS. This is called the Linear Probability Model (LPM):

\[\text{exit}_i = \beta_0 + \beta_1 \cdot \text{sales\_log}_i + \beta_2 \cdot \text{profit\_margin}_i + \varepsilon_i\]

lpm <- lm(exit ~ sales_log + profit_margin + employees_log,
          data = firms)
summary(lpm)

Call:
lm(formula = exit ~ sales_log + profit_margin + employees_log, 
    data = firms)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16173 -0.04660 -0.03429 -0.02043  0.99650 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.187786   0.040547   4.631 3.88e-06 ***
sales_log     -0.010614   0.003248  -3.268  0.00110 ** 
profit_margin  0.043991   0.011029   3.989 6.90e-05 ***
employees_log  0.019413   0.005690   3.411  0.00066 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1854 on 1912 degrees of freedom
Multiple R-squared:  0.01193,   Adjusted R-squared:  0.01038 
F-statistic: 7.697 on 3 and 1912 DF,  p-value: 4.117e-05

The coefficient interpretation is straightforward: a one-unit increase in profit_margin is associated with a 0.044 change in the probability of exit.

2.1.1 The problem: out-of-bounds predictions

lpm_fitted <- augment(lpm, data = firms)

lpm_fitted |>
  summarise(
    min_pred = min(.fitted),
    max_pred = max(.fitted),
    n_below_zero = sum(.fitted < 0),
    n_above_one  = sum(.fitted > 1)
  )
# A tibble: 1 × 4
  min_pred max_pred n_below_zero n_above_one
     <dbl>    <dbl>        <int>       <int>
1  -0.0225    0.176           49           0

We have 49 predicted values below zero and 0 above one. These are mathematically impossible as probabilities.

ggplot(lpm_fitted, aes(x = sales_log, y = .fitted)) +
  geom_point(alpha = 0.2, size = 0.8, color = "#2c7bb6") +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "firebrick") +
  annotate("label", x = max(firms$sales_log, na.rm = TRUE) - 1,
           y = -0.05, label = "Impossible (P < 0)",
           color = "firebrick", size = 3) +
  labs(
    title = "LPM: Fitted probabilities of firm exit",
    x = "Log sales", y = "Predicted P(exit = 1)"
  ) +
  theme_minimal()

2.1.2 The problem: heteroskedastic errors

When Y is binary, the variance of the error term depends on X by construction:

\[\text{Var}(\varepsilon_i | x_i) = P_i(1 - P_i)\]

This means OLS standard errors are not valid, so hypothesis tests and confidence intervals are unreliable. We can verify this visually — the residuals should show a clear fan pattern.

ggplot(lpm_fitted, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "LPM residuals vs. fitted values",
       x = "Fitted value", y = "Residual") +
  theme_minimal()

Warning

Despite these problems, the LPM is widely used in economics, especially with panel data. It can serve as a useful benchmark. But you should always be aware of its limitations.


3 Logistic Regression: Intuition

3.1 The logistic function

We need a function that maps any real number to a probability in [0, 1]. The logistic (sigmoid) function does exactly this:

\[\sigma(z) = \frac{e^z}{1 + e^z}\]

sigma <- function(z) exp(z) / (1 + exp(z))

tibble(z = seq(-6, 6, by = 0.05)) |>
  mutate(prob = sigma(z)) |>
  ggplot(aes(z, prob)) +
  geom_line(color = "#1a9641", linewidth = 1.3) +
  geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", color = "grey60") +
  labs(title = "The logistic function",
       x = "z (linear combination of predictors)",
       y = "Predicted probability") +
  theme_minimal()

The idea: let \(z = \beta_0 + \beta_1 x_1 + \ldots\) be the linear combination, and then:

\[\hat{P}(\text{exit} = 1 \mid \mathbf{x}) = \sigma(z) = \frac{e^{\mathbf{X}\boldsymbol{\beta}}}{1 + e^{\mathbf{X}\boldsymbol{\beta}}}\]

3.2 From the sigmoid to the logit

Rearranging the expression above gives the logit link:

\[\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]

The left-hand side is the log-odds. The relationship between probability, odds, and log-odds:

P(exit) Odds Log-odds
0.10 0.11 −2.20
0.25 0.33 −1.10
0.50 1.00 0.00
0.75 3.00 1.10
0.90 9.00 2.20

3.3 Maximum likelihood estimation

Logistic regression is not estimated by OLS but by maximum likelihood estimation (MLE). The idea: find the coefficient values β that make the observed data most probable. In practice, R’s glm() handles this for you.


4 Logistic Regression in R

4.1 Fitting the model

The syntax is almost identical to lm(). The key difference is family = binomial:

logit_model <- glm(exit ~ sales_log + profit_margin + employees_log,
                   family = binomial,
                   data = firms)

summary(logit_model)

Call:
glm(formula = exit ~ sales_log + profit_margin + employees_log, 
    family = binomial, data = firms)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.13919    0.89575  -0.155  0.87652    
sales_log     -0.21764    0.07365  -2.955  0.00313 ** 
profit_margin  1.05809    0.28546   3.707  0.00021 ***
employees_log  0.46545    0.14679   3.171  0.00152 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 594.18  on 1915  degrees of freedom
Residual deviance: 574.13  on 1912  degrees of freedom
AIC: 582.13

Number of Fisher Scoring iterations: 6
Important

If you forget family = binomial, R will silently fit a Gaussian (OLS) model on your binary outcome. Always check your model specification.

4.2 Comparing LPM and logit side by side

modelsummary(
  list("LPM" = lpm, "Logit" = logit_model),
  stars = TRUE,
  gof_map = c("nobs", "r.squared", "logLik", "AIC"),
  title = "Firm exit: LPM vs. logistic regression"
)
Firm exit: LPM vs. logistic regression
LPM Logit
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 0.188*** -0.139
(0.041) (0.896)
sales_log -0.011** -0.218**
(0.003) (0.074)
profit_margin 0.044*** 1.058***
(0.011) (0.285)
employees_log 0.019*** 0.465**
(0.006) (0.147)
Num.Obs. 1916 1916
R2 0.012
Log.Lik. 512.236 -287.063

Note: the logit model reports log-likelihood and AIC rather than R². These are the relevant goodness-of-fit statistics for maximum likelihood models.


5 Interpretation

5.1 Level 1: Log-odds (raw coefficients)

coef(logit_model)
  (Intercept)     sales_log profit_margin employees_log 
   -0.1391854    -0.2176396     1.0580916     0.4654490 

The coefficient on profit_margin is 1.058.

Interpretation: a one-unit increase in profit_margin is associated with a change of 1.058 in the log-odds of exit. This tells us the direction clearly, but the magnitude is not intuitive.

5.2 Level 2: Odds ratios

Exponentiate the coefficients to get odds ratios:

exp(coef(logit_model))
  (Intercept)     sales_log profit_margin employees_log 
    0.8700667     0.8044153     2.8808680     1.5927292 

The odds ratio for profit_margin is 2.881.

Interpretation: a one-unit increase in profit margin multiplies the odds of exit by 2.881, i.e., it reduces the odds by -188.1%.

We can add confidence intervals:

exp(confint(logit_model))
                  2.5 %    97.5 %
(Intercept)   0.1367246 4.6964981
sales_log     0.6998908 0.9358074
profit_margin 1.6423593 5.0529463
employees_log 1.1880168 2.1147743

5.3 Level 3: Predicted probabilities (most useful)

5.3.1 For observed data

firms <- firms |>
  mutate(pred_prob = predict(logit_model, type = "response"))

firms |>
  group_by(exit) |>
  summarise(mean_pred_prob = mean(pred_prob))
# A tibble: 2 × 2
   exit mean_pred_prob
  <dbl>          <dbl>
1     0         0.0355
2     1         0.0490

Firms that actually exited had, on average, a predicted exit probability of 0.049, compared to 0.036 for firms that survived. The model is doing something sensible.

5.3.2 For specific scenarios

This is often the most useful form for communicating results:

scenarios <- tibble(
  label         = c("Profitable, large", "Loss-making, small"),
  sales_log     = c(11, 9),
  employees_log = c(4, 2),
  profit_margin = c(0.10, -0.15)
)

scenarios <- scenarios |>
  mutate(
    pred_prob = predict(logit_model, newdata = scenarios, type = "response")
  )

scenarios
# A tibble: 2 × 5
  label              sales_log employees_log profit_margin pred_prob
  <chr>                  <dbl>         <dbl>         <dbl>     <dbl>
1 Profitable, large         11             4          0.1      0.362
2 Loss-making, small         9             2         -0.15     0.210

A profitable, large firm has a predicted exit probability of 0.362, compared to 0.21 for a small, loss-making firm.

5.3.3 Visualising predicted probabilities

# How does exit probability vary with profit margin?
grid <- tibble(
  profit_margin = seq(-0.5, 0.5, by = 0.01),
  sales_log     = mean(firms$sales_log, na.rm = TRUE),
  employees_log = mean(firms$employees_log, na.rm = TRUE)
)

grid <- grid |>
  mutate(pred_prob = predict(logit_model, newdata = grid, type = "response"))

ggplot(grid, aes(x = profit_margin, y = pred_prob)) +
  geom_line(color = "#1a9641", linewidth = 1.2) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "grey60") +
  labs(
    title = "Predicted probability of exit vs. profit margin",
    subtitle = "Other variables held at their mean",
    x = "Profit margin", y = "P(exit = 1)"
  ) +
  theme_minimal()

5.4 Average Marginal Effects

A cleaner summary of “the effect of X on P(exit)” is the Average Marginal Effect (AME). Because logit is non-linear, marginal effects differ across observations. The AME averages them:

avg_slopes(logit_model)

          Term Estimate Std. Error     z Pr(>|z|)    S    2.5 %   97.5 %
 employees_log  0.01592    0.00524  3.04  0.00239  8.7  0.00565  0.02619
 profit_margin  0.03618    0.01033  3.50  < 0.001 11.1  0.01593  0.05644
 sales_log     -0.00744    0.00260 -2.86  0.00426  7.9 -0.01255 -0.00234

Type: response
Comparison: dY/dX

The AME for profit_margin tells us: on average across all firms, a one-unit increase in profit margin is associated with a 0.036 change in the probability of exit. This is directly comparable to the LPM coefficient.


6 Quarto Skill: Inline R Code

When you write a report, you want your numbers to update automatically when the data or model changes. Use inline R code:

Instead of typing “0.65” manually, write:

The odds ratio for profit margin is 
2.88.

This renders as: “The odds ratio for profit margin is 2.88.”

If you re-estimate the model with different variables or data, the number in the text updates automatically. This is one of the most important professional habits in reproducible data analysis.


7 Summary

  • OLS on binary outcomes produces out-of-bounds predictions and invalid standard errors
  • Logistic regression uses the sigmoid function to ensure \(\hat{P} \in [0, 1]\)
  • Estimation is via maximum likelihood using glm(..., family = binomial)
  • Raw coefficients are in log-odds space — hard to interpret directly
  • Odds ratios \(= e^\beta\) — easier, but still require explanation
  • Predicted probabilities — most useful for business communication; compute with predict(..., type = "response")
  • Average marginal effects via marginaleffects::avg_slopes() — a cleaner, directly comparable effect size
  • Inline R code ensures your text and numbers stay in sync

8 Further Reading

  • Békés, G. & Kézdi, G. (2021). Data Analysis for Business, Economics, and Policy. Ch. 11. Available at gabors-data-analysis.com
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Ch. 4 (Classification). Available at statlearning.com