library(tidyverse)
library(modelsummary)
library(flextable)Exercise: What drives hotel prices in Vienna?
Session 2 — Multiple Regression: Going Beyond the Basics
Business Question
You work as an analyst for a hotel booking platform. Your manager asks:
“We know hotels further from the city centre tend to be cheaper. But does that penalty look the same for a budget three-star as for a luxury five-star property? And what else explains price differences once we account for location?”
In this exercise you will use multiple regression — including a categorical variable and an interaction term — to answer this question.
Data
Source: Békés & Kézdi (2021), Data Analysis for Business, Economics, and Policy. Hotels in Vienna for one date.
Download the dataset from OSF: https://osf.io/4e6d8/files/y6jvb/
The file you need is hotels-vienna.csv. Place it in the data/ subfolder next to this document.
hotels <- read_csv("data/hotels-vienna.csv") |> filter(distance>0)Explore the data
glimpse(hotels)Rows: 427
Columns: 24
$ country <chr> "Austria", "Austria", "Austria", "Austria", "Austri…
$ city_actual <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ rating_count <dbl> 36, 189, 53, 55, 33, 25, 57, 161, 50, NA, 203, 251,…
$ center1label <chr> "City centre", "City centre", "City centre", "City …
$ center2label <chr> "Donauturm", "Donauturm", "Donauturm", "Donauturm",…
$ neighbourhood <chr> "17. Hernals", "17. Hernals", "Alsergrund", "Alserg…
$ price <dbl> 81, 81, 85, 83, 82, 229, 103, 150, 80, 153, 60, 128…
$ city <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ stars <dbl> 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 4.0, 2.0, 3.0, 4…
$ ratingta <dbl> 4.5, 3.5, 3.5, 4.0, 3.5, 4.5, 3.5, 4.5, 3.5, NA, 4.…
$ ratingta_count <dbl> 216, 708, 629, 52, 219, 27, 251, 617, 146, NA, 359,…
$ scarce_room <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, …
$ hotel_id <dbl> 21894, 21897, 21901, 21902, 21903, 21904, 21906, 21…
$ offer <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, …
$ offer_cat <chr> "15-50% offer", "1-15% offer", "15-50% offer", "15-…
$ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
$ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ weekend <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ distance <dbl> 2.7, 1.7, 1.4, 1.7, 1.2, 0.9, 0.9, 1.0, 0.7, 1.5, 1…
$ distance_alter <dbl> 4.4, 3.8, 2.5, 2.5, 2.8, 3.0, 2.4, 2.7, 2.7, 2.7, 2…
$ accommodation_type <chr> "Apartment", "Hotel", "Hotel", "Hotel", "Hotel", "A…
$ nnights <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ rating <dbl> 4.4, 3.9, 3.7, 4.0, 3.9, 4.8, 3.9, 4.6, 3.5, NA, 4.…
Task 1.1 In 2–3 sentences, describe what one observation in this dataset represents. List the variables you plan to use and state whether each is numeric or categorical.
Task 1.2 Produce simple summary statistics (mean, standard deviation, min, max) for price, distance, and stars. Comment briefly: what is the range of prices? How far from the centre do hotels typically sit?
# Hint: compute summary statistics with summarise(), build a small data frame
# with one row per variable, then format it with flextable() — the same
# workflow you used in Task 1.
# Give the columns readable names: "Variable", "Mean", "Std. Dev.", "Min", "Max".
# YOUR CODE HEREData Preparation
For a focused analysis we restrict the sample to hotels with 3, 4, or 5 stars and drop any observations with missing prices or distances.
hotels_clean <- hotels |>
filter(stars %in% c(3, 3.5, 4, 4.5, 5)) |>
filter(!is.na(price), !is.na(distance)) |>
mutate(
# Create a three-category star variable — budget, mid-range, luxury
star_cat = case_when(
stars <= 3 ~ "3-star",
stars <= 4 ~ "4-star",
stars >= 4.5 ~ "5-star"
),
# Set 3-star as the reference category
star_cat = factor(star_cat, levels = c("3-star", "4-star", "5-star"))
)Task 2.1 How many observations remain after filtering? How many hotels fall in each star_cat group? Use count() to find out and state the result in one sentence.
# YOUR CODE HEREVisualise the key relationship
Task 2.2 Create a scatter plot of price (y-axis) against distance (x-axis). If necessary, transform the variables. Colour the points by star_cat and add a separate linear regression line for each category using geom_smooth(method = "lm", se = FALSE). Give the plot informative axis labels and a title.
In 2–3 sentences: do the regression lines run roughly parallel, or do they differ in slope? What does this suggest about the relationship between distance, star category, and price?
# YOUR CODE HERERegression Models
Estimating the models
Based on the scatter plot, use log-transformed price and distance in all models. In a log-log model, coefficients are elasticities: a 1% increase in distance is associated with a β% change in price.
- Model 1: Just regress the log of
priceon the log ofdistance - Model 2: Add
starsas additional explanatory variable - Model 3: Use
star_catinstead ofstarsadditional explanatory variable - Model 4: Extend model 3 to include interaction effects
# m1 <- lm(log(price) ~ ..., data = hotels_clean)# m2 <- lm(..., data = hotels_clean)# m3 <- lm(..., data = hotels_clean)# m4 <- lm(..., data = hotels_clean)Model Comparison
Produce a single modelsummary table containing all four models. Use coef_map to rename the row labels, and add a notes argument describing the sample and transformations.
coef_labels <- c(
"(Intercept)" = "Constant",
"log(distance)" = "log(Distance from centre)",
"stars" = "Star rating (continuous)",
"star_cat4-star" = "4-star hotel",
"star_cat5-star" = "5-star hotel",
"log(distance):star_cat4-star" = "log(Distance) × 4-star",
"log(distance):star_cat5-star" = "log(Distance) × 5-star"
)
# modelsummary(
# list(
# ADD HERE
# ),
# coef_map = coef_labels,
# gof_map = c("nobs", "r.squared", "adj.r.squared"),
# stars = TRUE,
# notes = ""
# )Interpretation
Use Table 2 to answer the following questions.
Task 3.1 What does the log(distance) coefficient in column (1) tell you? Why is a log-log specification more appropriate here than a linear model?
Task 3.2 In column (2), interpret the stars coefficient. What does this model assume about the price step from 3- to 4-star compared with 4- to 5-star? Is that assumption plausible?
Task 3.3 Answer the following using column (3):
- What is the reference category?
- Interpret the
4-star hotelcoefficient. (Hint: in a log-price model, the percentage premium is(exp(β) − 1) × 100.) - How does model (3) differ from model (2) in what it allows the data to tell you?
Task 3.4 Answer the following using column (4):
- Interpret the
log(Distance) × 4-starcoefficient. What does it say about how the distance penalty differs for 4-star hotels? - What is the total distance elasticity for a 5-star hotel? Show the calculation.
- Does this match what you saw in the scatter plot?
Task 3.5 Looking at the fit statistics at the bottom of the table:
- Does \(R^2\) increase as you add more variables? What does the adjusted \(R^2\) tell you that the plain \(R^2\) does not?
- Does the
log(distance)coefficient change substantially between columns (1), (3), and (4)? What does stability or instability suggest?
Communication
Task 4 Write a short paragraph (80–120 words) for your manager summarising the findings. Do not use statistical jargon — no p-values, no “coefficients”. Focus on the business insight: What drives hotel prices in Vienna? Does being far from the centre hurt all hotels equally? What practical implication does this have for a hotel trying to set its prices?