Exercise Solution: What drives hotel prices in Vienna?

Session 2 — Multiple Regression: Going Beyond the Basics

Author

Affiliation

Published

09 04 2026

library(tidyverse)
library(modelsummary)
library(flextable)

Business Question

You work as an analyst for a hotel booking platform. Your manager asks:

“We know hotels further from the city centre tend to be cheaper. But does that penalty look the same for a budget three-star as for a luxury five-star property? And what else explains price differences once we account for location?”

Data

hotels <- read_csv("data/hotels-vienna.csv") |> filter(distance>0)

Explore the data

glimpse(hotels)

Rows: 427
Columns: 24
$ country            <chr> "Austria", "Austria", "Austria", "Austria", "Austri…
$ city_actual        <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ rating_count       <dbl> 36, 189, 53, 55, 33, 25, 57, 161, 50, NA, 203, 251,…
$ center1label       <chr> "City centre", "City centre", "City centre", "City …
$ center2label       <chr> "Donauturm", "Donauturm", "Donauturm", "Donauturm",…
$ neighbourhood      <chr> "17. Hernals", "17. Hernals", "Alsergrund", "Alserg…
$ price              <dbl> 81, 81, 85, 83, 82, 229, 103, 150, 80, 153, 60, 128…
$ city               <chr> "Vienna", "Vienna", "Vienna", "Vienna", "Vienna", "…
$ stars              <dbl> 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 4.0, 2.0, 3.0, 4…
$ ratingta           <dbl> 4.5, 3.5, 3.5, 4.0, 3.5, 4.5, 3.5, 4.5, 3.5, NA, 4.…
$ ratingta_count     <dbl> 216, 708, 629, 52, 219, 27, 251, 617, 146, NA, 359,…
$ scarce_room        <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, …
$ hotel_id           <dbl> 21894, 21897, 21901, 21902, 21903, 21904, 21906, 21…
$ offer              <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, …
$ offer_cat          <chr> "15-50% offer", "1-15% offer", "15-50% offer", "15-…
$ year               <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
$ month              <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ weekend            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ holiday            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ distance           <dbl> 2.7, 1.7, 1.4, 1.7, 1.2, 0.9, 0.9, 1.0, 0.7, 1.5, 1…
$ distance_alter     <dbl> 4.4, 3.8, 2.5, 2.5, 2.8, 3.0, 2.4, 2.7, 2.7, 2.7, 2…
$ accommodation_type <chr> "Apartment", "Hotel", "Hotel", "Hotel", "Hotel", "A…
$ nnights            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ rating             <dbl> 4.4, 3.9, 3.7, 4.0, 3.9, 4.8, 3.9, 4.6, 3.5, NA, 4.…

Task 1.1 — sample answer

Each observation is one hotel in Vienna listed on a booking platform for a specific date. The key variables for this analysis are price (nightly rate in EUR, numeric), distance (distance from the city centre in km, numeric), and stars (star rating, numeric — we will convert this to a factor later).

Task 1.2

# Build a one-row-per-variable summary data frame
desc_stats <- tibble(
  Variable    = c("Price (EUR)", "Distance from centre (km)", "Star rating"),
  Mean        = c(mean(hotels$price,    na.rm = TRUE),
                  mean(hotels$distance, na.rm = TRUE),
                  mean(hotels$stars,    na.rm = TRUE)),
  `Std. Dev.` = c(sd(hotels$price,    na.rm = TRUE),
                  sd(hotels$distance, na.rm = TRUE),
                  sd(hotels$stars,    na.rm = TRUE)),
  Min         = c(min(hotels$price,    na.rm = TRUE),
                  min(hotels$distance, na.rm = TRUE),
                  min(hotels$stars,    na.rm = TRUE)),
  Max         = c(max(hotels$price,    na.rm = TRUE),
                  max(hotels$distance, na.rm = TRUE),
                  max(hotels$stars,    na.rm = TRUE))
) |>
  mutate(across(where(is.numeric), \(x) round(x, 2)))

min_price   <- round(min(hotels$price,      na.rm = TRUE))
max_price   <- round(max(hotels$price,      na.rm = TRUE))
mean_price  <- round(mean(hotels$price,     na.rm = TRUE))
avg_stars   <- round(mean(hotels$stars,     na.rm = TRUE), 1)
median_dist <- round(median(hotels$distance, na.rm = TRUE), 1)
max_dist    <- round(max(hotels$distance,    na.rm = TRUE), 1)

flextable(desc_stats) |>
  set_table_properties(width = 0.9, layout = "autofit") |>
  bold(part = "header")

Table 1: Descriptive statistics: Vienna hotels

Variable	Mean	Std. Dev.	Min	Max
Price (EUR)	131.11	91.53	27.0	1,012
Distance from centre (km)	1.66	1.60	0.1	13
Star rating	3.43	0.77	1.0	5

Sample comment: Nightly prices range widely — from about 27 EUR to about 1012 EUR — with an average around 131 EUR. The median distance from the city centre is 1.3 km, though the furthest hotel in the sample sits 13 km away. The average star rating is 3.4.

Data Preparation

hotels_clean <- hotels |>
  filter(stars %in% c(3, 3.5, 4, 4.5, 5)) |>
  filter(!is.na(price), !is.na(distance)) |>
  mutate(
    star_cat = case_when(
      stars <= 3   ~ "3-star",
      stars <= 4   ~ "4-star",
      stars >= 4.5 ~ "5-star"
    ),
    star_cat = factor(star_cat, levels = c("3-star", "4-star", "5-star"))
  )

Task 2.1

hotels_clean |> count(star_cat) |> flextable() |> theme_vanilla()

Table 2: The distribution of star ratings.

star_cat	n
3-star	140
4-star	199
5-star	35

Sample answer: After filtering, 374` observations remain. The majority are 3- and 4-star hotels; 5-star properties are the smallest group (see Table 2).

Visualise the key relationship

Task 2.2

ggplot(hotels_clean, aes(x = distance, y = price, colour = star_cat)) +
  geom_point(alpha = 0.4, size = 1.5) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_manual(
    values = c("3-star" = "#69AACD", "4-star" = "#00395B", "5-star" = "#C8102E")
  ) +
  labs(
    x      = "Distance from city centre (km)",
    y      = "Price per night (EUR)",
    colour = "Star category",
    title  = "Does location matter equally for all hotel tiers?"
  ) +
  theme_minimal()

Figure 1: Hotel price vs. distance from city centre, by star category (Vienna)

From this figure we see that the relationship is not linear, but exponential, so we should use logs:

ggplot(hotels_clean, aes(x = log(distance), y = log(price), colour = star_cat)) +
  geom_point(alpha = 0.4, size = 1.5) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_manual(
    values = c("3-star" = "#69AACD", "4-star" = "#00395B", "5-star" = "#C8102E")
  ) +
  labs(
    x      = "Distance from city centre (log(km))",
    y      = "Price per night log((EUR))",
    colour = "Star category",
    title  = "Does location matter equally for all hotel tiers?"
  ) +
  theme_minimal()

Figure 2: Hotel price vs. distance from city centre (logs), by star category (Vienna)

Sample comment: The three regression lines are not perfectly parallel. The 5-star line appears somewhat steeper (falling more steeply with distance) than the 3-star line, with 4-star hotels having an intermediate position. This visual pattern suggests that a distance penalty may be more pronounced for luxury hotels — exactly the business question we are investigating. The interaction model below will quantify this and help us to test statistically whether the differences in the slope are statistically significant.

Regression Models

Estimating the models

All models use log-transformed price and distance, motivated by the non-linear pattern visible in the scatter plot. In a log-log model, coefficients are interpreted as elasticities: a 1% increase in distance is associated with a β% change in price.

m1 <- lm(log(price) ~ log(distance), data = hotels_clean)

m2 <- lm(log(price) ~ log(distance) + stars, data = hotels_clean)

m3 <- lm(log(price) ~ log(distance) + star_cat, data = hotels_clean)

m4 <- lm(log(price) ~ log(distance) * star_cat, data = hotels_clean)

Model Comparison

coef_labels <- c(
  "(Intercept)"                    = "Constant",
  "log(distance)"                  = "log(Distance from centre)",
  "stars"                          = "Star rating (continuous)",
  "star_cat4-star"                 = "4-star hotel",
  "star_cat5-star"                 = "5-star hotel",
  "log(distance):star_cat4-star"   = "log(Distance) × 4-star",
  "log(distance):star_cat5-star"   = "log(Distance) × 5-star"
)

modelsummary(
  list(
    "(1) Distance"        = m1,
    "(2) Stars (numeric)" = m2,
    "(3) Stars (factor)"  = m3,
    "(4) Interaction"     = m4
  ),
  coef_map = coef_labels,
  gof_map  = c("nobs", "r.squared", "adj.r.squared"),
  stars    = TRUE,
  notes    = "Hotels-Vienna dataset. Békés & Kézdi (2021). Hotels with 3–5 stars. Outcome: log(price). Distance enters as log(distance)."
)

Table 3: Drivers of hotel prices in Vienna: regression results (log-log)

	(1) Distance	(2) Stars (numeric)	(3) Stars (factor)	(4) Interaction
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Hotels-Vienna dataset. Békés & Kézdi (2021). Hotels with 3–5 stars. Outcome: log(price). Distance enters as log(distance).
Constant	4.785***	3.631***	4.619***	4.590***
	(0.022)	(0.126)	(0.032)	(0.034)
log(Distance from centre)	-0.206***	-0.153***	-0.151***	-0.075*
	(0.022)	(0.021)	(0.020)	(0.036)
Star rating (continuous)		0.317***
		(0.034)
4-star hotel			0.171***	0.199***
			(0.042)	(0.042)
5-star hotel			0.766***	0.717***
			(0.072)	(0.082)
log(Distance) × 4-star				-0.099*
				(0.044)
log(Distance) × 5-star				-0.217**
				(0.083)
Num.Obs.	374	374	374	374
R2	0.190	0.342	0.380	0.394
R2 Adj.	0.188	0.339	0.375	0.386

Interpretation

Task 3.1 — sample answer

Column (1) in Table 3 shows the baseline model. The log(distance) coefficient is the price elasticity of distance: a 1% increase in distance from the centre is associated with approximately β% lower price. The log-log specification is motivated by the scatter plot, which showed that the price penalty flattens out at larger distances rather than declining at a constant euro rate.

Task 3.2 — sample answer

In column (2), the stars coefficient gives the approximate percentage price premium per additional star (exact: (exp(β) − 1) × 100%). Crucially, this model constrains the price step from 3- to 4-star to be exactly the same as the step from 4- to 5-star. Comparing the 4-star and 5-star coefficients in column (3) reveals whether that constraint is warranted.

Task 3.3 — sample answer

The reference category is 3-star — all star coefficients in columns (3) and (4) are measured relative to 3-star hotels at the same distance.
The 4-star hotel coefficient in column (3) is the difference in log price between a 4-star and a 3-star hotel. If the coefficient is 0.35, a 4-star hotel charges approximately (exp(0.35) − 1) × 100 ≈ 42% more than a comparable 3-star property.
Model (3) lets the 3→4-star and 4→5-star steps differ freely. If the 5-star premium is proportionally much larger than the 3-to-4-star step — as the coefficients in column (3) typically show — Model (2) would have misrepresented both steps by forcing them equal.

Task 3.4 — sample answer

log(Distance) × 4-star in column (4) is the difference in distance elasticity between 4-star and 3-star hotels. If the base elasticity is −0.12 and the interaction is −0.08, the total elasticity for 4-star hotels is −0.20: a 10% increase in distance reduces a 4-star price by about 2%, versus about 1.2% for a 3-star hotel.
The total distance elasticity for a 5-star hotel = log(Distance from centre) + log(Distance) × 5-star from column (4). Sum those two values.
The log-scale scatter plot showed visibly steeper slopes for higher-star tiers, and the negative interaction coefficients in column (4) confirm this quantitatively.

Task 4.1 — sample answer

R² increases from column (1) to (4), as adding variables always raises it. Adjusted R² penalises complexity: if it still rises from (3) to (4), the interaction terms are genuinely improving fit rather than just absorbing noise.
The log(distance) coefficient shifts between columns (1), (3), and (4). In column (1) it conflates the direct distance effect with the fact that cheaper hotels tend to be further out; controlling for star tier in (3) isolates the pure location penalty. The further shift in (4) reflects that once slopes differ by tier, the average distance effect is estimated differently again.

Communication

Task 5 — sample paragraph

In Vienna, hotel prices decline with distance from the city centre, but not at the same rate for all hotels. Budget three-star properties are relatively insensitive to location — moving a few kilometres further out barely affects their price. Luxury five-star hotels, by contrast, face a steeper penalty: their prices drop more sharply the further they sit from the centre, suggesting that guests paying premium rates place a higher value on central access. Star tier itself is the strongest driver of price: a five-star hotel commands a substantially higher rate than a comparable three-star property regardless of location. For a luxury operator, a central location is therefore not just a marketing asset — it is a direct contributor to pricing power.