Exercise: What drives hotel prices in Vienna?

Session 2 — Multiple Regression: Going Beyond the Basics

Author

Your Name (Student ID)

Published

09 04 2026

library(tidyverse)
library(modelsummary)
library(flextable)

Business Question

You work as an analyst for a hotel booking platform. Your manager asks:

“We know hotels further from the city centre tend to be cheaper. But does that penalty look the same for a budget three-star as for a luxury five-star property? And what else explains price differences once we account for location?”

In this exercise you will use multiple regression — including a categorical variable and an interaction term — to answer this question.

Data

Source: Békés & Kézdi (2021), Data Analysis for Business, Economics, and Policy. Hotels in Vienna for one date.

Download the dataset from OSF: https://osf.io/4e6d8/files/y6jvb/

The file you need is hotels-vienna.csv. Place it in the data/ subfolder next to this document.

hotels <- read_csv("data/hotels-vienna.csv")  |> filter(distance>0)

Explore the data

glimpse(hotels)
Rows: 427
Columns: 24
$ country            <chr> "Austria", "Austria", "Austria", "Austria", "Austri…
$ city_actual        <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ rating_count       <dbl> 36, 189, 53, 55, 33, 25, 57, 161, 50, NA, 203, 251,…
$ center1label       <chr> "City centre", "City centre", "City centre", "City …
$ center2label       <chr> "Donauturm", "Donauturm", "Donauturm", "Donauturm",…
$ neighbourhood      <chr> "17. Hernals", "17. Hernals", "Alsergrund", "Alserg…
$ price              <dbl> 81, 81, 85, 83, 82, 229, 103, 150, 80, 153, 60, 128…
$ city               <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ stars              <dbl> 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 4.0, 2.0, 3.0, 4…
$ ratingta           <dbl> 4.5, 3.5, 3.5, 4.0, 3.5, 4.5, 3.5, 4.5, 3.5, NA, 4.…
$ ratingta_count     <dbl> 216, 708, 629, 52, 219, 27, 251, 617, 146, NA, 359,…
$ scarce_room        <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, …
$ hotel_id           <dbl> 21894, 21897, 21901, 21902, 21903, 21904, 21906, 21…
$ offer              <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, …
$ offer_cat          <chr> "15-50% offer", "1-15% offer", "15-50% offer", "15-…
$ year               <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
$ month              <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ weekend            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ holiday            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ distance           <dbl> 2.7, 1.7, 1.4, 1.7, 1.2, 0.9, 0.9, 1.0, 0.7, 1.5, 1…
$ distance_alter     <dbl> 4.4, 3.8, 2.5, 2.5, 2.8, 3.0, 2.4, 2.7, 2.7, 2.7, 2…
$ accommodation_type <chr> "Apartment", "Hotel", "Hotel", "Hotel", "Hotel", "A…
$ nnights            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ rating             <dbl> 4.4, 3.9, 3.7, 4.0, 3.9, 4.8, 3.9, 4.6, 3.5, NA, 4.…

Task 1.1 In 2–3 sentences, describe what one observation in this dataset represents. List the variables you plan to use and state whether each is numeric or categorical.

Task 1.2 Produce simple summary statistics (mean, standard deviation, min, max) for price, distance, and stars. Comment briefly: what is the range of prices? How far from the centre do hotels typically sit?

Table 1: Descriptive statistics: Vienna hotels
# Hint: compute summary statistics with summarise(), build a small data frame
# with one row per variable, then format it with flextable() — the same
# workflow you used in Task 1.
# Give the columns readable names: "Variable", "Mean", "Std. Dev.", "Min", "Max".

# YOUR CODE HERE

Data Preparation

For a focused analysis we restrict the sample to hotels with 3, 4, or 5 stars and drop any observations with missing prices or distances.

hotels_clean <- hotels |>
  filter(stars %in% c(3, 3.5, 4, 4.5, 5)) |>
  filter(!is.na(price), !is.na(distance)) |>
  mutate(
    # Create a three-category star variable — budget, mid-range, luxury
    star_cat = case_when(
      stars <= 3   ~ "3-star",
      stars <= 4   ~ "4-star",
      stars >= 4.5 ~ "5-star"
    ),
    # Set 3-star as the reference category
    star_cat = factor(star_cat, levels = c("3-star", "4-star", "5-star"))
  )

Task 2.1 How many observations remain after filtering? How many hotels fall in each star_cat group? Use count() to find out and state the result in one sentence.

# YOUR CODE HERE

Visualise the key relationship

Task 2.2 Create a scatter plot of price (y-axis) against distance (x-axis). If necessary, transform the variables. Colour the points by star_cat and add a separate linear regression line for each category using geom_smooth(method = "lm", se = FALSE). Give the plot informative axis labels and a title.

In 2–3 sentences: do the regression lines run roughly parallel, or do they differ in slope? What does this suggest about the relationship between distance, star category, and price?

# YOUR CODE HERE

Regression Models

Estimating the models

Based on the scatter plot, use log-transformed price and distance in all models. In a log-log model, coefficients are elasticities: a 1% increase in distance is associated with a β% change in price.

  • Model 1: Just regress the log of price on the log of distance
  • Model 2: Add stars as additional explanatory variable
  • Model 3: Use star_cat instead of stars additional explanatory variable
  • Model 4: Extend model 3 to include interaction effects
# m1 <- lm(log(price) ~ ..., data = hotels_clean)
# m2 <- lm(..., data = hotels_clean)
# m3 <- lm(..., data = hotels_clean)
# m4 <- lm(..., data = hotels_clean)

Model Comparison

Produce a single modelsummary table containing all four models. Use coef_map to rename the row labels, and add a notes argument describing the sample and transformations.

Table 2: Drivers of hotel prices in Vienna: regression results (log-log)
coef_labels <- c(
  "(Intercept)"                  = "Constant",
  "log(distance)"                = "log(Distance from centre)",
  "stars"                        = "Star rating (continuous)",
  "star_cat4-star"               = "4-star hotel",
  "star_cat5-star"               = "5-star hotel",
  "log(distance):star_cat4-star" = "log(Distance) × 4-star",
  "log(distance):star_cat5-star" = "log(Distance) × 5-star"
)

# modelsummary(
#   list(
#       ADD HERE
#   ),
#   coef_map = coef_labels,
#   gof_map  = c("nobs", "r.squared", "adj.r.squared"),
#   stars    = TRUE,
#   notes    = ""
# )

Interpretation

Use Table 2 to answer the following questions.

Task 3.1 What does the log(distance) coefficient in column (1) tell you? Why is a log-log specification more appropriate here than a linear model?

Task 3.2 In column (2), interpret the stars coefficient. What does this model assume about the price step from 3- to 4-star compared with 4- to 5-star? Is that assumption plausible?

Task 3.3 Answer the following using column (3):

  1. What is the reference category?
  2. Interpret the 4-star hotel coefficient. (Hint: in a log-price model, the percentage premium is (exp(β) − 1) × 100.)
  3. How does model (3) differ from model (2) in what it allows the data to tell you?

Task 3.4 Answer the following using column (4):

  1. Interpret the log(Distance) × 4-star coefficient. What does it say about how the distance penalty differs for 4-star hotels?
  2. What is the total distance elasticity for a 5-star hotel? Show the calculation.
  3. Does this match what you saw in the scatter plot?

Task 3.5 Looking at the fit statistics at the bottom of the table:

  1. Does \(R^2\) increase as you add more variables? What does the adjusted \(R^2\) tell you that the plain \(R^2\) does not?
  2. Does the log(distance) coefficient change substantially between columns (1), (3), and (4)? What does stability or instability suggest?

Communication

Task 4 Write a short paragraph (80–120 words) for your manager summarising the findings. Do not use statistical jargon — no p-values, no “coefficients”. Focus on the business insight: What drives hotel prices in Vienna? Does being far from the centre hurt all hotels equally? What practical implication does this have for a hotel trying to set its prices?