library(tidyverse)
library(modelsummary)
library(flextable)Exercise Solution: What drives hotel prices in Vienna?
Session 2 — Multiple Regression: Going Beyond the Basics
Business Question
You work as an analyst for a hotel booking platform. Your manager asks:
“We know hotels further from the city centre tend to be cheaper. But does that penalty look the same for a budget three-star as for a luxury five-star property? And what else explains price differences once we account for location?”
Data
hotels <- read_csv("data/hotels-vienna.csv") |> filter(distance>0)Explore the data
glimpse(hotels)Rows: 427
Columns: 24
$ country <chr> "Austria", "Austria", "Austria", "Austria", "Austri…
$ city_actual <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ rating_count <dbl> 36, 189, 53, 55, 33, 25, 57, 161, 50, NA, 203, 251,…
$ center1label <chr> "City centre", "City centre", "City centre", "City …
$ center2label <chr> "Donauturm", "Donauturm", "Donauturm", "Donauturm",…
$ neighbourhood <chr> "17. Hernals", "17. Hernals", "Alsergrund", "Alserg…
$ price <dbl> 81, 81, 85, 83, 82, 229, 103, 150, 80, 153, 60, 128…
$ city <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ stars <dbl> 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 4.0, 2.0, 3.0, 4…
$ ratingta <dbl> 4.5, 3.5, 3.5, 4.0, 3.5, 4.5, 3.5, 4.5, 3.5, NA, 4.…
$ ratingta_count <dbl> 216, 708, 629, 52, 219, 27, 251, 617, 146, NA, 359,…
$ scarce_room <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, …
$ hotel_id <dbl> 21894, 21897, 21901, 21902, 21903, 21904, 21906, 21…
$ offer <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, …
$ offer_cat <chr> "15-50% offer", "1-15% offer", "15-50% offer", "15-…
$ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
$ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ weekend <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ distance <dbl> 2.7, 1.7, 1.4, 1.7, 1.2, 0.9, 0.9, 1.0, 0.7, 1.5, 1…
$ distance_alter <dbl> 4.4, 3.8, 2.5, 2.5, 2.8, 3.0, 2.4, 2.7, 2.7, 2.7, 2…
$ accommodation_type <chr> "Apartment", "Hotel", "Hotel", "Hotel", "Hotel", "A…
$ nnights <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ rating <dbl> 4.4, 3.9, 3.7, 4.0, 3.9, 4.8, 3.9, 4.6, 3.5, NA, 4.…
Task 1.1 — sample answer
Each observation is one hotel in Vienna listed on a booking platform for a specific date. The key variables for this analysis are price (nightly rate in EUR, numeric), distance (distance from the city centre in km, numeric), and stars (star rating, numeric — we will convert this to a factor later).
Task 1.2
# Build a one-row-per-variable summary data frame
desc_stats <- tibble(
Variable = c("Price (EUR)", "Distance from centre (km)", "Star rating"),
Mean = c(mean(hotels$price, na.rm = TRUE),
mean(hotels$distance, na.rm = TRUE),
mean(hotels$stars, na.rm = TRUE)),
`Std. Dev.` = c(sd(hotels$price, na.rm = TRUE),
sd(hotels$distance, na.rm = TRUE),
sd(hotels$stars, na.rm = TRUE)),
Min = c(min(hotels$price, na.rm = TRUE),
min(hotels$distance, na.rm = TRUE),
min(hotels$stars, na.rm = TRUE)),
Max = c(max(hotels$price, na.rm = TRUE),
max(hotels$distance, na.rm = TRUE),
max(hotels$stars, na.rm = TRUE))
) |>
mutate(across(where(is.numeric), \(x) round(x, 2)))
min_price <- round(min(hotels$price, na.rm = TRUE))
max_price <- round(max(hotels$price, na.rm = TRUE))
mean_price <- round(mean(hotels$price, na.rm = TRUE))
avg_stars <- round(mean(hotels$stars, na.rm = TRUE), 1)
median_dist <- round(median(hotels$distance, na.rm = TRUE), 1)
max_dist <- round(max(hotels$distance, na.rm = TRUE), 1)
flextable(desc_stats) |>
set_table_properties(width = 0.9, layout = "autofit") |>
bold(part = "header")Variable | Mean | Std. Dev. | Min | Max |
|---|---|---|---|---|
Price (EUR) | 131.11 | 91.53 | 27.0 | 1,012 |
Distance from centre (km) | 1.66 | 1.60 | 0.1 | 13 |
Star rating | 3.43 | 0.77 | 1.0 | 5 |
Sample comment: Nightly prices range widely — from about 27 EUR to about 1012 EUR — with an average around 131 EUR. The median distance from the city centre is 1.3 km, though the furthest hotel in the sample sits 13 km away. The average star rating is 3.4.
Data Preparation
hotels_clean <- hotels |>
filter(stars %in% c(3, 3.5, 4, 4.5, 5)) |>
filter(!is.na(price), !is.na(distance)) |>
mutate(
star_cat = case_when(
stars <= 3 ~ "3-star",
stars <= 4 ~ "4-star",
stars >= 4.5 ~ "5-star"
),
star_cat = factor(star_cat, levels = c("3-star", "4-star", "5-star"))
)Task 2.1
hotels_clean |> count(star_cat) |> flextable() |> theme_vanilla()star_cat | n |
|---|---|
3-star | 140 |
4-star | 199 |
5-star | 35 |
Sample answer: After filtering, 374` observations remain. The majority are 3- and 4-star hotels; 5-star properties are the smallest group (see Table 2).
Visualise the key relationship
Task 2.2
ggplot(hotels_clean, aes(x = distance, y = price, colour = star_cat)) +
geom_point(alpha = 0.4, size = 1.5) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_manual(
values = c("3-star" = "#69AACD", "4-star" = "#00395B", "5-star" = "#C8102E")
) +
labs(
x = "Distance from city centre (km)",
y = "Price per night (EUR)",
colour = "Star category",
title = "Does location matter equally for all hotel tiers?"
) +
theme_minimal()From this figure we see that the relationship is not linear, but exponential, so we should use logs:
ggplot(hotels_clean, aes(x = log(distance), y = log(price), colour = star_cat)) +
geom_point(alpha = 0.4, size = 1.5) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_manual(
values = c("3-star" = "#69AACD", "4-star" = "#00395B", "5-star" = "#C8102E")
) +
labs(
x = "Distance from city centre (log(km))",
y = "Price per night log((EUR))",
colour = "Star category",
title = "Does location matter equally for all hotel tiers?"
) +
theme_minimal()Sample comment: The three regression lines are not perfectly parallel. The 5-star line appears somewhat steeper (falling more steeply with distance) than the 3-star line, with 4-star hotels having an intermediate position. This visual pattern suggests that a distance penalty may be more pronounced for luxury hotels — exactly the business question we are investigating. The interaction model below will quantify this and help us to test statistically whether the differences in the slope are statistically significant.
Regression Models
Estimating the models
All models use log-transformed price and distance, motivated by the non-linear pattern visible in the scatter plot. In a log-log model, coefficients are interpreted as elasticities: a 1% increase in distance is associated with a β% change in price.
m1 <- lm(log(price) ~ log(distance), data = hotels_clean)m2 <- lm(log(price) ~ log(distance) + stars, data = hotels_clean)m3 <- lm(log(price) ~ log(distance) + star_cat, data = hotels_clean)m4 <- lm(log(price) ~ log(distance) * star_cat, data = hotels_clean)Model Comparison
coef_labels <- c(
"(Intercept)" = "Constant",
"log(distance)" = "log(Distance from centre)",
"stars" = "Star rating (continuous)",
"star_cat4-star" = "4-star hotel",
"star_cat5-star" = "5-star hotel",
"log(distance):star_cat4-star" = "log(Distance) × 4-star",
"log(distance):star_cat5-star" = "log(Distance) × 5-star"
)
modelsummary(
list(
"(1) Distance" = m1,
"(2) Stars (numeric)" = m2,
"(3) Stars (factor)" = m3,
"(4) Interaction" = m4
),
coef_map = coef_labels,
gof_map = c("nobs", "r.squared", "adj.r.squared"),
stars = TRUE,
notes = "Hotels-Vienna dataset. Békés & Kézdi (2021). Hotels with 3–5 stars. Outcome: log(price). Distance enters as log(distance)."
)| (1) Distance | (2) Stars (numeric) | (3) Stars (factor) | (4) Interaction | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| Hotels-Vienna dataset. Békés & Kézdi (2021). Hotels with 3–5 stars. Outcome: log(price). Distance enters as log(distance). | ||||
| Constant | 4.785*** | 3.631*** | 4.619*** | 4.590*** |
| (0.022) | (0.126) | (0.032) | (0.034) | |
| log(Distance from centre) | -0.206*** | -0.153*** | -0.151*** | -0.075* |
| (0.022) | (0.021) | (0.020) | (0.036) | |
| Star rating (continuous) | 0.317*** | |||
| (0.034) | ||||
| 4-star hotel | 0.171*** | 0.199*** | ||
| (0.042) | (0.042) | |||
| 5-star hotel | 0.766*** | 0.717*** | ||
| (0.072) | (0.082) | |||
| log(Distance) × 4-star | -0.099* | |||
| (0.044) | ||||
| log(Distance) × 5-star | -0.217** | |||
| (0.083) | ||||
| Num.Obs. | 374 | 374 | 374 | 374 |
| R2 | 0.190 | 0.342 | 0.380 | 0.394 |
| R2 Adj. | 0.188 | 0.339 | 0.375 | 0.386 |
Interpretation
Task 3.1 — sample answer
Column (1) in Table 3 shows the baseline model. The log(distance) coefficient is the price elasticity of distance: a 1% increase in distance from the centre is associated with approximately β% lower price. The log-log specification is motivated by the scatter plot, which showed that the price penalty flattens out at larger distances rather than declining at a constant euro rate.
Task 3.2 — sample answer
In column (2), the stars coefficient gives the approximate percentage price premium per additional star (exact: (exp(β) − 1) × 100%). Crucially, this model constrains the price step from 3- to 4-star to be exactly the same as the step from 4- to 5-star. Comparing the 4-star and 5-star coefficients in column (3) reveals whether that constraint is warranted.
Task 3.3 — sample answer
- The reference category is 3-star — all star coefficients in columns (3) and (4) are measured relative to 3-star hotels at the same distance.
- The
4-star hotelcoefficient in column (3) is the difference in log price between a 4-star and a 3-star hotel. If the coefficient is 0.35, a 4-star hotel charges approximately(exp(0.35) − 1) × 100 ≈ 42%more than a comparable 3-star property. - Model (3) lets the 3→4-star and 4→5-star steps differ freely. If the 5-star premium is proportionally much larger than the 3-to-4-star step — as the coefficients in column (3) typically show — Model (2) would have misrepresented both steps by forcing them equal.
Task 3.4 — sample answer
log(Distance) × 4-starin column (4) is the difference in distance elasticity between 4-star and 3-star hotels. If the base elasticity is −0.12 and the interaction is −0.08, the total elasticity for 4-star hotels is −0.20: a 10% increase in distance reduces a 4-star price by about 2%, versus about 1.2% for a 3-star hotel.- The total distance elasticity for a 5-star hotel =
log(Distance from centre)+log(Distance) × 5-starfrom column (4). Sum those two values. - The log-scale scatter plot showed visibly steeper slopes for higher-star tiers, and the negative interaction coefficients in column (4) confirm this quantitatively.
Task 4.1 — sample answer
- R² increases from column (1) to (4), as adding variables always raises it. Adjusted R² penalises complexity: if it still rises from (3) to (4), the interaction terms are genuinely improving fit rather than just absorbing noise.
- The
log(distance)coefficient shifts between columns (1), (3), and (4). In column (1) it conflates the direct distance effect with the fact that cheaper hotels tend to be further out; controlling for star tier in (3) isolates the pure location penalty. The further shift in (4) reflects that once slopes differ by tier, the average distance effect is estimated differently again.
Communication
Task 5 — sample paragraph
In Vienna, hotel prices decline with distance from the city centre, but not at the same rate for all hotels. Budget three-star properties are relatively insensitive to location — moving a few kilometres further out barely affects their price. Luxury five-star hotels, by contrast, face a steeper penalty: their prices drop more sharply the further they sit from the centre, suggesting that guests paying premium rates place a higher value on central access. Star tier itself is the strongest driver of price: a five-star hotel commands a substantially higher rate than a comparable three-star property regardless of location. For a luxury operator, a central location is therefore not just a marketing asset — it is a direct contributor to pricing power.