A geographer’s introduction to space-time regression with GAMs using stgam

Lex Comber, Paul Harris and Chrs Brunsdon

April 2025

Overview

The stgam package (Comber, Harris, and Brunsdon 2024) provides a wrapper around mgcv functionality (S. Wood 2021) to support space-time analysis, with a focus on both prediction but also importantly on inference (or process understanding). The latter is commonly the objective of geographical analysis, which are often concerned with how processes and statistical relationships vary over space and / or time.

This vignette illustrates a workflow for analyses to quantify and model variations over space and time based on GAM smooths. The case study examines house price (actually price per unit area) for terraced houses in London 2018 to 2024 and highlights the importance of investigating the data for spatial and / or temporal variability before constructing space-time GAMs and undertaking the stgam workflow.

You should load the stgam package and the data. The data are sample of terraced house sales data for 2018 to 2024 extracted from Bin et al. (2024), and located via the ONS lookup tables that allows UK postcodes to be linked to location (actually the geometric centroids of the postcode area).

library(cols4all)
library(dplyr)
library(ggplot2)
library(tidyr)
library(sf)
library(cowplot)
# load the package and data
library(stgam)
data("hp_data")
data("lb")

You should examine the help for the datasets and especially the variables that hp_data contains:

help(hp_data)
help(lb)

1. Data considerations

The first stage is simply examine the data, particularly the continuous variables over space and time. The code below examines the data and undertakes some initial investigations.

# examine what is loaded 
hp_data
#> # A tibble: 1,888 × 13
#>    price priceper   tfa dot          yot  beds type    cef   pef ageb      lad            X      Y
#>    <dbl>    <dbl> <dbl> <date>     <dbl> <int> <chr> <int> <int> <chr>     <chr>      <int>  <int>
#>  1 1000     7463. 134   2023-02-22  2023     9 T        60    82 1930-1949 E09000014 531674 188707
#>  2  520     7879.  66   2022-05-27  2022     3 T        57    74 1930-1949 E09000031 536888 189269
#>  3  325     5078.  64   2021-01-15  2021     4 T        50    82 1950-1966 E09000016 549763 190717
#>  4  300     2400  125   2020-01-17  2020     5 T        64    85 1900-1929 E09000031 538646 186513
#>  5  585     7222.  81   2020-11-13  2020     4 T        75    88 1983-1990 E09000027 514710 172535
#>  6  618.    6786.  91   2023-05-18  2023     4 T        61    84 1950-1966 E09000008 533233 170230
#>  7 1183.   12117.  97.6 2022-04-22  2022     5 T        63    66 1900-1929 E09000027 516471 174772
#>  8  800     6154. 130   2023-09-28  2023     7 T        60    80 1900-1929 E09000010 532032 194831
#>  9  560     6022.  93   2022-12-21  2022     5 T        47    79 1930-1949 E09000029 526674 164298
#> 10  300     3750   80   2019-09-16  2019     4 T        71    86 1950-1966 E09000004 548369 179349
#> # ℹ 1,878 more rows

The print out of the first 10 records in hp_data indicates that we have a time variable (dot) in date format as well as location in metres (X and Y). The analyses will model price per unit area (the priceper variable in hp_data). This is in units of pounds per spare metre and GAM smooth models will be constructed that regress this against other variables in the data including location and time.

However, it often more useful to have a continuous scalar value for the time variable, in this case the date of sale transfer (dot), and to have location in kilometres rather than metres. The code below uses the earliest date in the dataset to create a variable called days to represent time (here 100s of days since the earliest date) and rescales the locational data, but retaining the original coordinates for mapping:

hp_data <- 
  hp_data |> 
  # create continuous time variable
  mutate(days = as.numeric(dot - min(dot))/100) |>
  relocate(days, .after = dot) |>
  # scale location and retain original coordinates
  mutate(Xo = X, Yo = Y) |>
  mutate(X = X/1000, Y = Y/1000)

Again this can be examined:

hp_data

The data and the target variable can be mapped as in the code snippet below. Note the use of the Xo and Yo coordinates in the mapping with ggplot.

# map the data layers
lb |> 
  ggplot() + geom_sf() +
  geom_point(data = hp_data |> st_drop_geometry(), aes(x = Xo, y = Yo, col = priceper)) +
  scale_color_viridis_c(option = "magma") +
  theme_bw()  +xlab("") + ylab("")
The spatial variation of the priceper variable (house price per square metre in £s).
The spatial variation of the priceper variable (house price per square metre in £s).

The map above shows the spatial distribution of priceper and also indicates that there were no terraced house sales in the centre of London (in the City of London local authority district) in this period. In many of the analyses below the lb spatial dataset of London boroughs will be used to provide spatial context to the results. To do this the coordinates fo the spatial data need to be rescaled in the same way the X and Y variables in hp_data were rescaled to aid model construction. The code below creates a rescaled version of the lb projected in kilometres (remember the original data can always be reloaded using data(lb)).

#transform to km
lb <- st_transform(lb, pipeline = "+proj=pipeline +step +proj=unitconvert +xy_out=km")
# remove the projection to avoid confusing ggplot
st_crs(lb) <- NA

The correlations and distributions of the numeric variables can be examined. This is important in order to establish that any regression of the proposed target variable against the predictor variables may yield something interesting and sensible, and just as importantly to identify any variables that might be a problem when trying to make a model. For example, in this case it is good that the proposed target variable (priceper) has a healthy tail.

The code below generates boxplots and then histograms.

# boxplots
hp_data |> 
  select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
  pivot_longer(-lad) |>
  ggplot(aes(x = value), fil) + 
  geom_boxplot(fill="dodgerblue") +
  facet_wrap(~name, scales = "free") +
  theme_bw()
Boxplots of the numeric variables in the data.
Boxplots of the numeric variables in the data.
# histograms
hp_data |> 
  select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
  pivot_longer(-lad) |>
  ggplot(aes(x = value), fil) + 
  geom_histogram(aes(y=after_stat(density)),bins = 30, 
                 fill="tomato", col="white") +
  geom_density(alpha=.5, fill="#FF6666") +
  facet_wrap(~name, scales = "free") +
  theme_bw()
Histograms of the numeric variables in the data.
Histograms of the numeric variables in the data.

Examining correlations is for similar reasons: to check for issues that may be a problem later on in the analysis workflow. Here the aim is to establish that there are reasonable correlations with the target variable and lack of collinearity amongst the predictor variables, when they are examined without considering time or space (i.e. they are examined globally).

# correlations
hp_data |> 
  select(priceper, price, tfa,  beds, cef, pef) |>
  cor() |> round(3)
#>          priceper  price    tfa   beds    cef    pef
#> priceper    1.000  0.723  0.308  0.179 -0.118 -0.220
#> price       0.723  1.000  0.791  0.549 -0.082 -0.216
#> tfa         0.308  0.791  1.000  0.780 -0.034 -0.202
#> beds        0.179  0.549  0.780  1.000 -0.079 -0.170
#> cef        -0.118 -0.082 -0.034 -0.079  1.000  0.388
#> pef        -0.220 -0.216 -0.202 -0.170  0.388  1.000

Finally, thinking about space-time analysis of your dataset, as a general heuristic, any dataset for use in space-time analysis should have a minimum of about 100 locations and a minimum of 50 observation time periods. You should consider this when you are assembling your data.

2. Detecting variability

The investigations in the previous section were all concerned with establishing the viability of undertaking the proposed analysis, using standard exploratory data analysis approaches to understand distributions and correlations. In this section these are extended to test for the presence of variability in the data but this now over time, space and space-time. This is done by constructing a series of regression models, of different forms and with different parameters. Tools for constructing Generalized Additive Models (GAMs) using the mgcv package (S. Wood 2021) are introduced but with particular focus on examining using GAM smooths or splines to explore variations over space and time.

2.1 Initial investigations with OLS and dummy variables

As an initial baseline model, the code below creates a standard OLS regression model of priceper using the lm function. The model fit is weak, but the model summary indicates that the some of the variables (pef and beds) are significant predictors of the target variable. The analysis of variance (ANOVA), calculated using the anova function, shows how much variance in priceper is explained by each predictor and whether that contribution is statistically significant. Here higher values in the F value test statistic provides stronger evidence the predictor is useful.

So there is evidence that all three predictors (cef, pef, beds) are statistically significant and the F-values and associated p-values (Pr(>F)) show strong evidence that each variable contributes meaningfully to explaining variation in priceper. The model likely has decent explanatory power, assumptions of no major issues like multicollinearity.

# an OLS model
m_ols <- lm(priceper ~ cef + pef + beds, data = hp_data)
summary(m_ols)
#> 
#> Call:
#> lm(formula = priceper ~ cef + pef + beds, data = hp_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7057.6 -1752.2  -554.9  1220.4 23271.9 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 10751.489    690.879  15.562  < 2e-16 ***
#> cef            -9.633      6.325  -1.523    0.128    
#> pef           -59.626      8.047  -7.410 1.90e-13 ***
#> beds          298.329     46.327   6.440 1.52e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2740 on 1884 degrees of freedom
#> Multiple R-squared:  0.06997,    Adjusted R-squared:  0.06849 
#> F-statistic: 47.25 on 3 and 1884 DF,  p-value: < 2.2e-16
anova(m_ols)
#> Analysis of Variance Table
#> 
#> Response: priceper
#>             Df     Sum Sq   Mean Sq F value    Pr(>F)    
#> cef          1 2.1265e+08 212648515  28.323 1.149e-07 ***
#> pef          1 5.4018e+08 540183718  71.948 < 2.2e-16 ***
#> beds         1 3.1134e+08 311344258  41.468 1.516e-10 ***
#> Residuals 1884 1.4145e+10   7508013                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A final and quick trick to confirm the presence of space and time effects is to use them as a dummy variable with one of the variables. Here this is done with tfa (total floor area) that unsurprisingly is highly correlated with the target variable (priceper):

# Dummy with LADs
m_dummy1 <- lm(priceper ~ tfa:lad, data = hp_data)
summary(m_dummy1)
anova(m_dummy1)

This shows that there are important and significant differences between the first London borough (which has a lad value equal to “E09000001”) that is not shown in the model summary and many of the the rest that are, suggesting some important spatial differences in the interaction of floor area and price in different boroughs. This is confirmed by the F values in the ANOVA.

# Dummy with Time
m_dummy2 <- lm(priceper ~ tfa:days, data = hp_data)
summary(m_dummy2)
anova(m_dummy2)

Similarly, an even larger effect with time is demonstrated when days is used as the dummy variable.

Thus, despite the models being globally weak with low \(R^2\) values, there is evidence of spatial and temporal interactions with the target variable that warrant further more formal exploration with respect to potential temporal and spatial trends. GAMs with smooths offer a route to investigate these.

2.2 Detecting variability using GAM smooths

GAMs with splines (or smooths - the terms are used interchangeably) can be used to explore and quantify spatial and temporal variations. However before considering spatial and temporal aspects, it is important to outline what splines do.

Splines help to fit a (smooth) curve through a cloud of messy data points — like drawing a line that follows a winding mountain road. They generate flexible (bendy) fits (curves) that are able to adapt to the shape of your data, and thus describe the relationship between variables better than a straight line. For example, consider the scatterplot of some simulated data below: it would be difficult to fit a straight line through it.

# create simulated data 
set.seed(12)
x  <- runif(500)
mu <- sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
y  <- rnorm(500, mu, .3)
# plot x and y
ggplot() + 
    geom_point(aes(x,y)) +
  theme_bw()
Simulated x and y data.
Simulated x and y data.

The code below uses a GAM spline to determine the relationship between x and y without having to pre-specify any particular form (e.g. quadratic, exponential etc). The bendiness or wiggliness of the fit is determined automatically by the s function in different ways depending on the type of smooth (via penalization and basis functions for thin plate smooths, the default and in smooths created using s in the mgcv package). Effectively what this does is split the data into chunks and fit piecewise polynomial relationships (rather than fitting a single straight line).

# a GAM illustration with a spline using s
gam_s_example <- gam(y~s(x))
# extract the smooth fit
y.s <- gam_s_example$fitted.values
# plot
ggplot() +
  geom_point(aes(x,y), col = "grey") +
    geom_line(aes(x, y = y.s), lwd = 1) +
  theme_bw()
Simulated x and y data with GAM a Spline fitted.
Simulated x and y data with GAM a Spline fitted.

The purpose of this diversion into splines and smooths is to illustrate how they model different (slopes) relationships with y in different locations along the x feature space. In this way spline smooth curves can be used to capture non-linear relationships in attribute space (here the relationship of x with y). Importantly, in the context of space-time varying coefficient modelling with stgam, the spline can be used to model how relationships between x and y varies with respect to time or location if the smooth is also parameterised with those.

2.3 Temporal variability

It would be useful to examine the how the target variable, priceper, varies over time. One way of doing this is to extend the use GAM smooths illustrated above. The code snippet below seek to model the variation in priceper but this time using time (the days variable).

# the first GAM 
gam.1 <- gam(priceper~s(days), data = hp_data)
summary(gam.1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(days)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      64.11   106.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>           edf Ref.df     F p-value    
#> s(days) 3.798  4.708 15.76  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.0374   Deviance explained = 3.93%
#> GCV = 7.7786e+06  Scale est. = 7.7588e+06  n = 1888

The smooth can be investigated by plotting it with time. Note the use of the actual data variable dot rather than days in the code below - this is simply to have a friendly x-axis on the plot! - and the use of the predict function to extract the standard errors:

# create a data frame with x, predicted y, standard error
x <- hp_data$dot
y <- gam.1$fitted.values
se <- predict(gam.1, se = TRUE, hp_data)$se.fit
u <- y+se
l <- y-se
df <- data.frame(x, y, u, l)
# plot!
ggplot(df, aes(x, y, ymin = l, ymax = u)) + 
  geom_ribbon(fill = "lightblue") + 
  geom_line() + 
  theme_bw() + 
  xlab("Date") + ylab("priceper") 
A plot of the temporal smooth.
A plot of the temporal smooth.

This shows variation in the modelled relationship of priceper with time. The code above extracted the predicted values (\(\hat{y}\)) of priceper from the model and plotted them against time, but using the actual date not the continuous variable. The plot shows how the relationship of the target variable varies with time and potentially reflecting the effects over the pandemic, with increases in the value of homes with there own gardens and outdoor spaces (even small ones!) rather than apartments or flats. This provides confirmation of the potential suitability of a temporal analysis of priceper.

2.4 Spatial variability

The spatially varying trends can be examined in similar way, again using a standard s spline with location.

# the second GAM
gam.2 <- gam(priceper~s(X,Y), data = hp_data)
summary(gam.2)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      43.62   156.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>          edf Ref.df     F p-value    
#> s(X,Y) 27.47  28.85 81.52  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.554   Deviance explained = 56.1%
#> GCV = 3.6469e+06  Scale est. = 3.5919e+06  n = 1888
plot(gam.2, asp = 1)
A mgcv plot of the spatial smooth.
A mgcv plot of the spatial smooth.

Notice that the model fit (\(R^2\)) has increased substantially from the previous models. This suggest that this variable does vary over space. Again the smooth can be investigated by plotting, but this time rather than a line it is 2-Dimensional surface. However, it would be useful to to visualise this as a surface rather than just over the hp_data point locations. The code below uses the predict function used to model the relationship over a hexagonal grid covering the extent of the London boroughs, with values for X and Y of the centroid of hexagonal cell, and plots the surface over geographic space:

# 1. create a grid object the study area from the LB  data
l_grid <- 
  st_make_grid(lb, square=FALSE,n=50) |> 
  st_sf() |> 
  st_join(lb) |> 
  filter(!is.na(name))
# rename the geometry, sort row names 
st_geometry(l_grid) = "geometry"
rownames(l_grid) <- 1:nrow(l_grid)
# create and add coordinates X and Y
coords <- l_grid |> st_centroid() |> st_coordinates() 
#> Warning: st_centroid assumes attributes are constant over geometries
l_grid <- l_grid |> bind_cols(coords) 

You may wish to inspect this object:

l_grid
plot(st_geometry(l_grid))

Before continuing with the mapping procedure:

# 2. predict over the grid
yhat <- predict(gam.2, newdata = l_grid)
l_grid |> mutate(yhat = yhat) |>
  # 4.and plot
  ggplot() + 
  geom_sf(aes(fill = yhat), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "priceper") +
  # add context
  geom_sf(data = lb, fill = NA) +
  # apply and modify plot theme
  theme_bw() +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"))
A map of the smoothed (predicted) response variable over hexagon grid.
A map of the smoothed (predicted) response variable over hexagon grid.

This models the spatial trends in the variations over space of the target variable. The results indicate similar high relationships with space in the centre of the study area as in the initial map of the data earlier, and again provides confirmatory evidence that the some spatially explicit modelling of priceper is appropriate in this case.

2.5 Spatial and temporal variability I

The spline can be extends to model the target variable in space-time by including the location and time variables in the smooth. However, a slight different set of considerations are required because we are mixing space and time in the mgcv smooths. Consider the code snippet below for the extension into space-time smooths: it specifies a smooth using the t2 function rather than s and contains other parameters in the d argument.

# the third GAM
gam.3 <- gam(priceper~t2(X,Y,days, d = c(2,1)), data = hp_data)
summary(gam.3)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ t2(X, Y, days, d = c(2, 1))
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      42.57   160.5   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                edf Ref.df     F p-value    
#> t2(X,Y,days) 47.87  52.07 37.39  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.576   Deviance explained = 58.6%
#> GCV = 3.5118e+06  Scale est. = 3.4209e+06  n = 1888

It is important to consider the difference between s(X,Y,days) (which would a continuation of the format above but has not been run) and t2(X,Y,days,d=c(2,1)) before examining the resultant GAM (gam.3). The spline functions s() and t2() can model smooth interactions between multiple variables, they handle dimension scaling, marginal smoothness, and basis construction in different ways.

Thus in this case the TP smooth is specified because space and time are combined. The code below examines the results and uses the mgcv plot function to show the spatial distributions over 9 discretised time chunks (~310 days).

plot(gam.3, asp = 1)

However it is also possible to slice and dice the time variable in other ways. The code snippet below illustrates how this can be done for 365 day periods using the calculate_vcs function in the stgam package. If predictor variables are included in the terms parameter then the function coefficient estimates (“b_”) and standard errors (“se_”) for each of terms specified. However, in this case we are just interested in the predicted value of the target variable, yhat. The code below uses the grid object created above (l_grid) as a spatial framework to hold the prediction of the priceper variable over these discrete time periods.

# 1. create time intervals (see the creation of days variable above)
pred_days = seq(0, 2190, 365)/100
# 2. create coefficient estimates for each time period (n = 7) 
res_out <- NULL
for (i in pred_days){
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
                         mgcv_model = gam.3,
                         terms = NULL)
  res_out <- cbind(res_out, res.i$yhat)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y_", 2018:2024)

l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(-name, -lad, -X, -Y) |>
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = paste0("Y_", 2018:2024))) |>
  
  # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "Predicted \n'priceper'") +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())   
A plot of a spatial smooth over 7 approximately annual time periods.
A plot of a spatial smooth over 7 approximately annual time periods.

What this time series set of maps show is that without any exploratory variables, the trends in priceper vary spatially and change in intensity over time (the increase and decrease in values over space), with limited variation in their spatial pattern.

2.6 Spatial and temporal variability II

The formula used to construct the gam.3 model in the previous subsection constructed a smooth that implicitly combined space and time, under an assumption that the variations in space and time of priceper interact with each other. However this may not be the case. More generally spatial and temporal dependencies might not be expected to interact in data for relatively small region, whose observations might be expected to be subject to the same socio-economic or environmental pressures. Instead, any space and time effects might be expected to be independent of each other. Whereas in a national study any space and time effects might be expected to interact more strongly.

It is possible to use a different construction involving separate smooths for space and time, with the s() splines, and to avoid the assumption of the interaction of space and time effects:

# the fourth GAM
gam.4 <- gam(priceper~s(X,Y) + s(days), data = hp_data)
summary(gam.4)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y) + s(days)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      42.45     161   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value    
#> s(X,Y)  27.509 28.861 83.76  <2e-16 ***
#> s(days)  3.536  4.392 23.88  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.578   Deviance explained = 58.5%
#> GCV = 3.4606e+06  Scale est. = 3.4019e+06  n = 1888

Again there is evidence of spatial and temporal trends in the data and these can be visualised using the plot.gam function in mgcv (called with just plot).

plot(gam.4, page = 1)
A mgcv plot of the spatial and temporal smooths.
A mgcv plot of the spatial and temporal smooths.

Also, the space-time interactions could be generated in the same way as for the gam.3 model above, using the X and Y values in l_grid and the time slices in pred_days.

2.7 Including other predictor variables

Up until now only the the response (target) variable, priceper has been examined against time and location in GAM smooth specified in different ways, with the aim of exploring the spatial and temporal variability of the variable, but also to introduce GAM smooths. The code below includes the pef predictor variable in a GAM model as a fixed parametric term (i.e. in the same way as in the OLS model created above).

# the fifth GAM
gam.5 <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 8826.648    408.776  21.593  < 2e-16 ***
#> pef          -24.802      5.057  -4.905 1.02e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value    
#> s(X,Y)  27.507 28.861 77.85  <2e-16 ***
#> s(days)  3.473  4.316 24.89  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.583   Deviance explained =   59%
#> GCV = 3.4216e+06  Scale est. = 3.3618e+06  n = 1888

The model summary indicates that pef is a significant predictor of priceper and improves the model fit. However, this raises a key question of how should predictor variables be included in in the model: in parametric form or in a smooth? This is addressed in the next section.

As before the smooths can be plotted:

plot(gam.5, page = 1) 

2.8 Summary

This section has introduced GAMs with splines / smooths as a way of investigating any space-time effects present in the data and their characteristics. The mechanics of splines was described and illustrated, and a brief introduction to different spline forms was provided. Variations in the response variable (priceper) in time, space and space-time were explored through elementary model fitting and plotting the smooth graphical trends. In this case, effects in space and time are present in the priceper predictor variable and it looks like there is a universal spatial trend at all times and universal temporal trend everywhere. The final GAM model included explanatory variables (pef and beds) which added some explanatory power but these were not examined with respect to space or time (i.e. in smooths). This is done in the next section.

Such investigations are an important initial step. They guide subsequent analysis rather than making assumptions about space time interactions and plugging every predictor variable into a space-time smooth of some kind. They provide evidence of space-time variations in the response variable - i.e. whether the effects in space and time are present in the data - and can help determine which predictor variables may be of further interest.

3. Effects of Time and Space with predictor variables

The previous section examined variability fo the response variable in time, space and space-time (the latter in 2 different ways). In this section these investigations are extended to consider the variation of the response variable (priceper) with the pef predictor variable, but this time with the Intercept as an addressable term in the model.

Consider the gam.5 model created above:

gam.5  <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)

Note that the gam.5 model includes the terms s(X,Y) + s(days). These model the spatially varying and temporally varying Intercept plus error. Note that if this was a spatial model (without time), then just s(X,Y) would be included for the Intercept and similarly just s(days) for a purely temporal model.

It gam.5 model can be reformulated as follows with an addressable Intercept term:

gam.5.new <- gam(priceper ~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + pef,
                data = hp_data |> mutate(Intercept = 1))
summary(gam.5)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 8826.648    408.776  21.593  < 2e-16 ***
#> pef          -24.802      5.057  -4.905 1.02e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value    
#> s(X,Y)  27.507 28.861 77.85  <2e-16 ***
#> s(days)  3.473  4.316 24.89  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.583   Deviance explained =   59%
#> GCV = 3.4216e+06  Scale est. = 3.3618e+06  n = 1888

The model summary now includes Intercept as a parametric term (rather than (Intercept)) and s(X,Y):Intercept and s(days):Intercept as smooth terms rather than s(X,Y)and s(days), but the values in both summaries are all the same.

The formula in subsequent models in this vignette include the Intercept as an addressable term. The code below creates this variable in the input data and in order to facilitate mapping of the resulting coefficient estimates, the models in this section are specified with the original coordinates:

hp_data <- 
  hp_data |> 
  mutate(Intercept = 1)

3.1 Time

The code below specifies a GAM regression model with smooths for space and time as in the previous section but now with the pef predictor variable, included in parametric form and in a temporal smooth with days. This suggests in this case that it is important to model this relationship over time.

gam.t <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + 
              pef + s(days, by = pef), 
            data = hp_data)
summary(gam.t)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(days, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8738.828    410.218  21.303  < 2e-16 ***
#> pef        -11.846      2.538  -4.668 3.27e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F p-value    
#> s(X,Y):Intercept  27.526 28.864 78.064 < 2e-16 ***
#> s(days):Intercept  3.507  4.358  4.389 0.00119 ** 
#> s(days):pef        1.500  1.500  6.255 0.00294 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 49/50
#> R-sq.(adj) =  0.584   Deviance explained =   94%
#> GCV = 3.4159e+06  Scale est. = 3.3544e+06  n = 1888

The summary of this model indicates that the relationship of pef with priceper changes over time and has a strong linear negative trend (the smooth s(days):pef is significant as is the pef parametric coefficient). The nature of the temporally varying coefficients can be examined using the caculate_vcs function in the stgam package. This returns a tibble with coefficient estimates (“b_”) and standard errors (“se_”) for each of terms specified.

vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.t, terms = c("Intercept", "pef"))
head(vcs)
#> # A tibble: 6 × 22
#>   price priceper   tfa dot         days   yot  beds type    cef   pef ageb     lad       X     Y     Xo     Yo Intercept
#>   <dbl>    <dbl> <dbl> <date>     <dbl> <dbl> <int> <chr> <int> <int> <chr>    <chr> <dbl> <dbl>  <int>  <int>     <dbl>
#> 1 1000     7463.   134 2023-02-22 18.8   2023     9 T        60    82 1930-19… E090…  532.  189. 531674 188707         1
#> 2  520     7879.    66 2022-05-27 16.0   2022     3 T        57    74 1930-19… E090…  537.  189. 536888 189269         1
#> 3  325     5078.    64 2021-01-15 11.1   2021     4 T        50    82 1950-19… E090…  550.  191. 549763 190717         1
#> 4  300     2400    125 2020-01-17  7.44  2020     5 T        64    85 1900-19… E090…  539.  187. 538646 186513         1
#> 5  585     7222.    81 2020-11-13 10.4   2020     4 T        75    88 1983-19… E090…  515.  173. 514710 172535         1
#> 6  618.    6786.    91 2023-05-18 19.6   2023     4 T        61    84 1950-19… E090…  533.  170. 533233 170230         1
#> # ℹ 5 more variables: b_Intercept <dbl>, se_Intercept <dbl>, b_pef <dbl>, se_pef <dbl>, yhat <dbl[1d]>

The temporal variation of the relationship of pef with the target variable can be plotted, and in this case the linear trend is confirmed: the negative relationship of pef with the target variable does not change over time.

vcs |> 
  mutate(u = b_pef + se_pef, 
         l = b_pef - se_pef) |>
  ggplot(aes(x = dot, y = b_pef, ymin = l, ymax = u)) + 
  geom_ribbon(fill = "lightblue") + 
  geom_line() + 
  theme_bw() + 
  xlab("Date") + ylab("pef") 
A plot of the temporal smooth for pef.
A plot of the temporal smooth for pef.

3.2 Space

The spatial variation in the relationship of the predictor variable with the target variable can also be examined. The code below specifies the pef predictor variable within a spatial smooth, but again with a spatially and temporally varying intercept. Notice how pef is included in both the smooth and as parametric (global) term.

gam.s <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + 
              pef + s(X,Y, by = pef), 
            data = hp_data)
summary(gam.s)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(X, Y, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept   8229.6      442.6  18.593   <2e-16 ***
#> pef         -111.8      168.1  -0.665    0.506    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F p-value    
#> s(X,Y):Intercept  29.000 29.000  5.742 < 2e-16 ***
#> s(days):Intercept  3.536  4.391 24.374 < 2e-16 ***
#> s(X,Y):pef        21.034 25.628  2.067 0.00122 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 69/70
#> R-sq.(adj) =  0.594   Deviance explained = 94.2%
#> GCV = 3.3672e+06  Scale est. = 3.269e+06  n = 1888

Here is is evident at pef is varying significantly over space (see s(X,Y):pef) but is not significant as a global fixed term. That is, the coefficient estimate slope as a global slope is not significantly different from zero, but is when varying over space and allowing for Intercept to vary over space and time. This finding could be related to the age of the houses, which were built in clusters in different locations.

Again the varying coefficients can be extracted from the model for the predictor variable and mapped both at observation locations and over the hexagonal grid if a time period for the Intercept smooth in gam.s is specified:

# 1.over observation locations
vcs <- calculate_vcs(input_data = hp_data, 
                     mgcv_model = gam.s, 
                     terms = c("Intercept", "pef"))
tit <-expression(paste(""*beta[`pef`]*"")) 
p1 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) + 
  scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw() + 
  theme(legend.position = "bottom", 
        legend.key.width = unit(1, "cm"),) + 
  xlab("") + ylab("")
# 2. over grid - recall it needs an intercept term and a days value!
vcs <- calculate_vcs(input_data = l_grid |> mutate(Intercept = 1, days = mean(hp_data$days)), 
                     mgcv_model = gam.s, 
                     terms = c("Intercept", "pef"))
p2 <- 
  ggplot() + 
  geom_sf(data = vcs, aes(fill = b_pef), col = NA) + 
  scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw()+
  theme(legend.position = "bottom", 
        legend.key.width = unit(1, "cm"),) + 
  xlab("") + ylab("")
plot_grid(p1, p2)
Maps of the spatial smooth for pef over the original observations and the hexagonal grid.
Maps of the spatial smooth for pef over the original observations and the hexagonal grid.

Here the relationship of pef, the Potential energy efficiency rating, varies over space with a distinct inflection from negative values in the centre of the study area to positive ones in outer regions.

3.3 Space-Time I

It is also possible to examine the space-time dependencies between the target variable and the pef predictor variable. The code below specifies a GAM with a space-time smooth, with a spatially and temporally varying intercept. As before, a single TP smooth is specified with the dimensions of each basis (a 2D basis for X and Y; a 1D basis for days), as different degrees of smoothness are expected across space and time.

gam.st1 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +  
                pef + t2(X,Y,days, d= c(2,1), by = pef), 
              data = hp_data)
summary(gam.st1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + t2(X, Y, days, d = c(2, 1), by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8686.866    411.590  21.106  < 2e-16 ***
#> pef        -10.827      2.262  -4.786 1.83e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F  p-value    
#> s(X,Y):Intercept  27.514 28.862 65.183  < 2e-16 ***
#> s(days):Intercept  3.519  4.372  4.383  0.00119 ** 
#> t2(X,Y,days):pef   5.555  5.556  5.862 1.63e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 164/165
#> R-sq.(adj) =  0.584   Deviance explained =   94%
#> GCV = 3.4246e+06  Scale est. = 3.3556e+06  n = 1888

The pef variable is still not significant globally, but it is over space and time (t2(X,Y,days):pef). When the space-time trends are examined (see the figures below), the increasingly negative effect over time is evident and the spatial distributions indicate a general lower relationship with the target variable in the east fo the study area and increasingly negative one to the west.

# calculate the varying coefficient estimates
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.st1, terms = c("Intercept", "pef"))
# temporal trends
p_time <- 
  vcs |> 
  select(dot, b_Intercept, b_pef) |> 
  pivot_longer(-dot) |>
  mutate(name = recode(name, 
                       "b_Intercept" = '""*beta[Intercept]',
                       "b_pef" = '""*beta[pef]')) |>
  ggplot(aes(x = dot, y = value)) +
  geom_point(alpha = 0.1) + 
  geom_smooth() + 
  facet_wrap(~name,  labeller = label_parsed, scale = "free", ncol = 1) +
  theme_bw() + xlab("Year") + ylab("") 
# spatial trends
tit <-expression(paste(""*beta[`Intercept`]*"")) 
p_sp1 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_Intercept), alpha = 1) + 
  scale_colour_continuous_c4a_seq("brewer.yl_gn_bu", name = tit) +
  theme_bw() + 
  xlab("") + ylab("")
tit <-expression(paste(""*beta[`pef`]*"")) 
p_sp2 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) + 
  scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw() + 
  xlab("") + ylab("")

plot_grid(p_time, plot_grid(p_sp1, p_sp2, ncol = 1), nrow = 1, rel_widths = c(3.5,6))
Summaries of coefficient estimates from the the space-time smooths for pef.
Summaries of coefficient estimates from the the space-time smooths for pef.

Of course it is also possible to plot time slices of the spatial distribution of the relationship of pef with priceper over grids as was done before. Notice in the code below how the terms for the Intercept and pef are extracted fro each time period in the looped call to calculate_vcs. This east-west trend and its changes over time are confirmed.

# 1. create time intervals (as above)
pred_days = seq(0, 2190, 365)/100
# 2. create coefficient estimates for each time period (n = 7) 
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in pred_days){
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
                         mgcv_model = gam.st1,
                         terms = c("Intercept", "pef"))
  # select just the coefficient estimates of interest
  res.i <- res.i |> st_drop_geometry() |> select(starts_with("b_pef")) 
  res_out <- cbind(res_out, res.i)
}

# 3. name with years and join to the grid
colnames(res_out) <- paste0("PEF", "_", 2018:2024)
l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(starts_with("PEF")) |> 
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = colnames(res_out))) |>
  
   # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_div(name = "Potential Energy\nEfficiency") +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The changes over time of the spatial distrubtuion of the pef coefficient estimate.
The changes over time of the spatial distrubtuion of the pef coefficient estimate.

3.4 Space-Time II

The previous section created a GAM model with a single space-time smooth for the pef predictor variable. It is also possible to include them as separate smooths, as was done to investigate the variability of the target variable. The code snippet below does this again with a spatially and temporally varying intercept.

gam.st2 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) 
              + pef + s(X,Y, by = pef) + s(days, by = pef), data = hp_data)
summary(gam.st2)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(X, Y, by = pef) + s(days, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8204.971    442.323  18.550   <2e-16 ***
#> pef         -5.215     13.861  -0.376    0.707    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df     F p-value    
#> s(X,Y):Intercept  28.000 28.000 5.870 < 2e-16 ***
#> s(days):Intercept  3.579  4.443 3.896 0.00277 ** 
#> s(X,Y):pef        21.654 26.022 2.039 0.00156 ** 
#> s(days):pef        1.333  1.333 1.432 0.32924    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 77/80
#> R-sq.(adj) =  0.595   Deviance explained = 94.2%
#> GCV = 3.3618e+06  Scale est. = 3.2623e+06  n = 1888

Here the fixed global term is again not significant, and interestingly neither is separate temporal smooth. The plots fo the smooths confirm this. Together this confirms what was found in the step above in the previous subsection with a combined TP smooth and indicates that there is a spatial trend in the relationship of pef with priceper that changes in intensity over time but not in its spatial distribution. This suggests a super-imposition of spatial and temporal trends: the temporal smooths with days is not significant.

plot(gam.st2, pages = 1)
mgcv plots of the seperate space and time smooths for the pef predictor variable.
mgcv plots of the seperate space and time smooths for the pef predictor variable.

3.5 Summary

This section has applied smooths to a single predictor variable. These have explored time, space and space-time, by constructing smooths parameterised with time and location. These allow changes in the relationship between the predictor variable and the target variable to be examined over space and time. Importantly, a space-time GAM was constructed in 2 different: by including space and time in a singe smooth and then using separate space smooths and time smooths. These reflect different assumptions about the nature of the space-time dependencies in the data and result in different inferences (understanding) about the process if the significant components of the model are considered. This may not be important if the objective of constructing the GAM-based varying coefficient models is prediction, but is if the objective is process understanding, as is commonly the case in geographical analyses of space-time data.

There 2 important implications of this:

  1. It suggests the need to to examine model form and to determine how space and time should be best be used to parameterise GAM smooths. This is done in the next section using an automated approach to model selection.
  2. This would provide a route to avoid super-imposing space-time trends by simply constructing a single smooth combining space time. In the example above, the spatial effect was clear but was proportionately the same over time, i.e. where the (partial derivative), \(\delta f / (\delta Space, \delta Time) = 0\).

4. Working with stgam: model selection

The models in construction in Section 3 used a variety of smooths. The Intercept was modelled a parametric term with separate space and time smooths, and the pef predictor variable was included in parametric form, in parametric form with a single space-time smooth and in parametric form with separate space time smooths. This poses the question of which form to use and how to specify the model? Which is best? A key focus of the stgam package is to seek to answer this question. It does this by creating and evaluating multiple models.

4.1 Using GCV to evaluate GAMs with smooths

One way to evaluate models is to compare them using some metric. Commonly AIC (Akaike 1998) corrected AIC (AICc) (Hurvich and Tsai 1989) or BIC (Schwarz 1978) are used to compare models as they provide a parsimonious measures model fit (balancing model complexity and prediction accuracy). However, there are some uncertainties over using these to compare mgcv GAM smooth models due to variations in the effective degrees of freedom (EDF) introduced by the smooths. The result that model EDFs vary depending on how the smoothing parameters are selected, potentially leading to over-penalization or under-penalization in BIC calculations. As a result model Generalized Cross-Validation (GCV) score is recommended for evaluating and comparing mgcv GAMs (S. N. Wood 2017; Marra and Wood 2011). GCV provides an un-biased risk estimator (UBRE) estimate of model fit. It is similar to BIC as it is able to balance model fit with complexity, but does not suffer from some of problems of using BIC when applied to GAMs (e.g. bias in non-parametric model comparison, over-penalises complex smooths etc). The best model is one that minimises the GCV score.

The code below extracts the GCV from each of the models in the last section and prints them out in order. Here it can be seen that the GAM model with pef specified in separate space-time smooths generated the best model.

df <- data.frame(Model = c("Time", "Space", "Space-Time I", "Space-Time II"),
                 GCV = c(gam.t$gcv.ubre, gam.s$gcv.ubre, gam.st1$gcv.ubre, gam.st2$gcv.ubre))
# rank the models 
df |> arrange(GCV)
#>           Model     GCV
#> 1 Space-Time II 3361850
#> 2         Space 3367181
#> 3          Time 3415938
#> 4  Space-Time I 3424640

4.2 Model selection: determining model form

In a space-time model there are 6 options for each predictor variable:

  1. It is omitted.
  2. It is included as a parametric response with no smooth.
  3. It is included in parametric form and in a spatial smooth with location.
  4. It is included in parametric form and in a temporal smooth with time.
  5. It is included in parametric form and in a single space-time smooth.
  6. It is included in parametric form and in 2 separate space and time smooths.

The intercept can be treated similarly, but without it being absent (i.e. 5 options). Obviously there are 3 options for spatial models (i., ii. and iii.) and for temporal models (i., ii. and iv.), with each having 2 options for the intercept. Thus for a purely spatial or purely a temporal regression with \(k\) predictor variables there are \(2 \times 3^k\) potential models and for a space-time regression there are \(5 \times 6^k\) potential models to evaluate.

Recall that the hp_data contains a number of numeric variables that could be of use in explaining the space-time variations in the priceper target variable:

head(hp_data)
#> # A tibble: 6 × 17
#>   price priceper   tfa dot         days   yot  beds type    cef   pef ageb     lad       X     Y     Xo     Yo Intercept
#>   <dbl>    <dbl> <dbl> <date>     <dbl> <dbl> <int> <chr> <int> <int> <chr>    <chr> <dbl> <dbl>  <int>  <int>     <dbl>
#> 1 1000     7463.   134 2023-02-22 18.8   2023     9 T        60    82 1930-19… E090…  532.  189. 531674 188707         1
#> 2  520     7879.    66 2022-05-27 16.0   2022     3 T        57    74 1930-19… E090…  537.  189. 536888 189269         1
#> 3  325     5078.    64 2021-01-15 11.1   2021     4 T        50    82 1950-19… E090…  550.  191. 549763 190717         1
#> 4  300     2400    125 2020-01-17  7.44  2020     5 T        64    85 1900-19… E090…  539.  187. 538646 186513         1
#> 5  585     7222.    81 2020-11-13 10.4   2020     4 T        75    88 1983-19… E090…  515.  173. 514710 172535         1
#> 6  618.    6786.    91 2023-05-18 19.6   2023     4 T        61    84 1950-19… E090…  533.  170. 533233 170230         1

These include cef (Current energy efficiency rating) pef (Potential energy efficiency rating) and beds (Number of bedrooms), as well as location (X and Y) and time (days). Thus with \(k = 3\) variables could are 1080 potential models to evaluate in space-time case study and 54 potential models in a spatial case study. Here 2 of the variables are used to illustrate an evaluation of 180 space-time models, pef as before and beds.

The evaluate_models() function in the stgam package constructs and compares the full set of potential models. It uses the GCV value of each model to evaluate and rank them. Note that in the code below, ncores is set to 2 pass CRAN diagnostic checks - you may want to specify more using detectCores()-1 from the parallel package. Remember that the input data needs to have a defined Intercept term which can be created in the following way input_data |> mutate(Intercept = 1)) (this was done at the start of Section 3 for hp_data).

library(doParallel)
t1 <- Sys.time()
stvc_mods <- evaluate_models(
  input_data = hp_data,
  target_var = "priceper",
  vars = c("pef", "beds"),
  coords_x = "X",
  coords_y = "Y",
  VC_type = "STVC",
  time_var = "days",
  ncores = 2)
Sys.time() - t1  # about 14 minutes (6 minutes with 15 cores!)

The best 10 models can be extracted (i.e. those with the lowest GCV score) using the gam_model_rank() function in the stgam package introduced in Section 3. Interestingly the top 5 ranked models all have space and time specified for each predictor variable, but in different combinations of single and separate space-time smooths. The top 4 models all have the beds predictor variable specified in separate space time smooths, suggesting that while there are space and time dependences with the target variable, there are not space-time ones.

mod_comp <- gam_model_rank(stvc_mods)
# have a look
mod_comp |> select(-f) 
#>    Rank Intercept       pef      beds      GCV
#> 1     1     t2_ST s_T + s_S s_T + s_S 16813.82
#> 2     2 s_T + s_S s_T + s_S s_T + s_S 16815.01
#> 3     3     t2_ST     t2_ST s_T + s_S 16815.67
#> 4     4 s_T + s_S     t2_ST s_T + s_S 16822.48
#> 5     5 s_T + s_S s_T + s_S     t2_ST 16823.02
#> 6     6       s_S s_T + s_S s_T + s_S 16824.47
#> 7     7     t2_ST       s_S s_T + s_S 16826.02
#> 8     8     t2_ST s_T + s_S     t2_ST 16826.23
#> 9     9     t2_ST s_T + s_S       s_S 16827.10
#> 10   10 s_T + s_S       s_S s_T + s_S 16827.90

4.3 The final model

A final model for use in analysis can be specified in the form of the top ranked model above. This included the Intercept in a single space-time TP smooth and in separate space and time smooths for pef and beds.

First the formula is extracted and can be inspected:

f <- as.formula(mod_comp$f[1])
f
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) + 
#>     s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) + 
#>     s(days, by = beds)

Then this is put into a mgcv GAM model, and checked for over-fitting using the k.check function in themgcv package. It underpins the gam.check function but does not display the diagnostic plots. Here the k-index values are near to 1, the k' and edf parameters are not close, and importantly the edf values are all much less than the k' values, so this model is reasonably well tuned. NB if the k' and edf parameters are close for some smooths, then you may want to increase the k parameter in the relevant smooth. These are automatically determined by the mgcv package but can be specified manually - see the mgcv help for k.check and choose.k.

gam.m <- gam(f, data = hp_data, method = "REML")
k.check(gam.m)
#>                         k'       edf   k-index p-value
#> t2(X,Y,days):Intercept 124 30.729788 0.8165558  0.0000
#> s(X,Y):pef              30  3.055193 0.6432838  0.0000
#> s(days):pef             10  2.183809 1.0422023  0.9675
#> s(X,Y):beds             30 24.222744 0.6432838  0.0000
#> s(days):beds            10  2.936562 1.0422023  0.9550

A summary of the model can be examined and this shows that nearly all of the terms are significant except the temporal smooth for beds.

summary(gam.m)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) + 
#>     s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) + 
#>     s(days, by = beds)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept  10040.1      454.6   22.09   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                           edf Ref.df     F  p-value    
#> t2(X,Y,days):Intercept 30.730 37.032 2.236 3.43e-05 ***
#> s(X,Y):pef              3.055  3.082 3.189  0.01350 *  
#> s(days):pef             2.184  2.673 5.151  0.00223 ** 
#> s(X,Y):beds            24.223 26.353 5.647  < 2e-16 ***
#> s(days):beds            2.937  3.349 1.062  0.31011    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 203/205
#> R-sq.(adj) =  0.601   Deviance explained = 61.5%
#> -REML =  16814  Scale est. = 3.2129e+06  n = 1888

4.4 Varying coefficient estimates

The spatially and temporally varying coefficients estimates generated by the model can be extracted in a number of different ways using the calulate_vcs function in the stgam package, as indicated in earlier sections. There are basically 2 options for generating the varying coefficient estimates: i) over the original data points or ii) over spatial surfaces for specific time slices.

For the first option the original data, the GAM model and a vector of the model terms (i.e. predictor variables names) are simply passed to the calculate_vcs function. The second option requires slices of the space-time data to be passed to the calulate_vcs function and a deeper consideration of how the results are processed and combined. In both cases the results can be summarised, mapped.

Considering first the data of the original observations:

vcs <- calculate_vcs(input_data = hp_data, 
                    mgcv_model = gam.m, 
                    terms = c("Intercept", "pef", "beds"))

A summary over space and time of the coefficient estimates shows that the intercept the is large and positive and that pef and beds are generally negative but with some positive values at in some places or at some times.

vcs |> select(starts_with("b_")) |>
  apply(2, summary) |> round(1)
#>         b_Intercept b_pef b_beds
#> Min.         4236.0 -60.0 -552.4
#> 1st Qu.      8303.6 -34.5 -369.7
#> Median      10187.8 -26.3 -260.3
#> Mean        10040.1 -26.1 -240.8
#> 3rd Qu.     11703.2 -17.6 -163.4
#> Max.        15778.7   7.7  388.0

These can be examined over time:

vcs |> select(dot, starts_with("b_")) |> 
  rename(`Intercept` = b_Intercept,
         `Potential Energy Efficiency` = b_pef,
         `Bedrooms` = b_beds) |>
  pivot_longer(-dot) |>
  mutate(name = factor(name, levels=c("Intercept","Potential Energy Efficiency", "Bedrooms"))) |>
  group_by(dot, name) |>
  summarise(
    lower = quantile(value, 0.25),
    median = median(value),
    upper = quantile(value, 0.75)
  ) |>
  ggplot(aes(x = dot, y = median)) +
  geom_point(col = "blue", alpha = 0.2) +
  geom_smooth() +
  facet_wrap(~name, scale = "free_y") +
  theme_bw() + xlab("") + ylab("") +
  theme(strip.background = element_rect(fill="white"))
#> `summarise()` has grouped output by 'dot'. You can override using the `.groups` argument.
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The changes over time of the coefficient estimates of the final model.
The changes over time of the coefficient estimates of the final model.

Standard dplyr and ggplot approaches can be used to join and map the coefficient estimates as in the figure below, which summarises the coefficient estimates for pef (b_pef) by year.

# make spatial data  
vcs_sf <- vcs |> st_as_sf(coords = c("X", "Y"), remove = F) 
# plot 
ggplot()  + 
  geom_sf(data = lb) +
  geom_sf(data = vcs_sf, aes(col = b_pef)) + 
  scale_colour_continuous_c4a_div(palette="brewer.rd_bu", 
                                name = "Potential\nEnergy Efficiency") + 
  facet_wrap(~yot) + 
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The varying pef (Potential Energy Efficiency) coefficient estimates over space and time.
The varying pef (Potential Energy Efficiency) coefficient estimates over space and time.

Finally the l_grid can be used summarise the coefficient estimates over space and time as well, here summarising each year, approximately described in the pred_days object defined above. Notice how the for loop below combines all the coefficient estimates this time.

The maps reflect the relatively small changes in the relationship with priceper over time plotted above.

# create time slices
years  <- 2018:2024
# calculate  over the grid for each time slice
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in 1:length(years)){
  # convert years to days
  day.val = (years[i]-2018) * 365 / 100
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = day.val),
                         mgcv_model = gam.m,
                         terms = c("Intercept", "pef", "beds"))
  # select all the coefficient estimates 
  res.i <- 
    res.i |> 
    st_drop_geometry() |> 
    select(starts_with("b_"), 
           starts_with("se_")) 
  # rename them
  names(res.i) <- paste0(names(res.i), "_", years[i])
  # bind to the result
  res_out <- cbind(res_out, res.i)
  cat(years[i], "\t")
}
#> 2018     2019    2020    2021    2022    2023    2024    
# join to the grid
l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(starts_with("b_beds")) |> 
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = paste0("b_beds_", 2018:2024))) |>

   # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = "Bedrooms") +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The varying beds (Bedrooms) coefficient estimates over time and the grid surface.
The varying beds (Bedrooms) coefficient estimates over time and the grid surface.

4.5 Summary

This section has illustrated the the use of functions in the the stgam package. It suggests the following workflow:

  1. Prepare the data: lengthen the data.frame, tibble or sf object containing the data to have single location and time variables for each observation (in the above examples these were X,Yand days), and an Intercept as an addressable term.
  2. Evaluate all possible models. For spatial or temporal problems each predictor is specified in 3 ways, for space-time problems each predictor is specified in 6 ways.
  3. Rank the models by their GCV score, identfy any consistent model specification trends in the top ranked models and select the best model with the lowest GCV score.
  4. Extract the formula and create the final model.
  5. Calculate the varying coefficient estimates: to quantify how the relationships between the target and predictor variables vary over space, time or space-time.
  6. Create maps, time series plots etc

This evaluates and ranks multiple models using model GCV value. This was done algorithmically using the evaluate_models function. However, this was not undertaken in isolation. Rather it built on the investigations in Section 3 to determine whether space and time effects were present in the data. This set of model investigations were gone through to both develop and confirm knowledge of processes related to house prices in London: that is the analysis was both considered and contextual. For example, in Section 3 it was determined that a varying intercept term was appropriate and with a different dataset there may be a need to explore different avenues.

Similarly more of the model space could have been examined, for example to include cef in the models. The stgam package allows these choices to be validated through an automated approach, providing an exploration of the full set of potential choices. The the GCV as an unbiased risk estimator was useful helping to evaluate models

5. Notes, resources and summaries

5.1 Use REML in the GAMs

All of the space time GAM models in this vignette have been specified with method = "REML". Model estimation in GAMs can be conducted in two ways: (a) predictive (i.e., out-of-sample minimization) via Generalized Cross-Validation (GCV) and (b) Bayesian (i.e. attach priors on basis coefficients) via restricted maximum likelihood (REML). REML tends to provide more stable estimates of the smoothing parameters (i.e., reduce over-fitting) and tends to provide more robust estimates of the variance. GCV is more computationally efficient but can over-fit, especially when errors are correlated. GCV is set as the default in mgcv.

5.2 Number of knots in space and time

In Section 4.3, the GAM model was checked for under- and over-fitting using the k.check function in themgcv package the advice was to increase k in the smooths if the k' and edf parameters were close. This can be done by specifying it manually in the smooth as in the hypothetical example below.

gam_m <- gam(y~s(X,Y, by = x, k = 40), data = input_data)

The number of knots in the smooth (\(k\)) determines the complexity of the response-to-predictor variable smooths in the GAM. However, determining k is a balance between setting it too large which can result in over-fitting and under-fitting if it is too low. Computational issues (speed and model stability) can arise if k is set to be too large, especially in the space-time case for small datasets.

5.3 Resources

We think these talks by Noam Ross provide really useful tips and pointers for working with GAMs and mgcv:

5.4 Summary

The rationale for using GAMs with TP smooths and GP bases for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows:

The workflow suggested in this vignette and in the stgam package is to determine the most appropriate model form (both steps evaluated by minimising a Generalized Cross-Validation (GCV), an un-biased risk estimator the model recommended for evaluating mgcv GAMs (S. N. Wood 2017)). This avoids making potentially unreasonable assumptions about how space and time interact in spatially and temporally varying coefficient (STVC) models. The best model can then be extracted an examined for the nature of the space-time interactions it suggests, before generating the varying coefficient estimates and mapping or plotting the results.

Future developments will seek to examine the scales of spatial dependencies in the data sing kriging based approaches and will extend the stgam package to work with large data (i.e. to draw from the functionality of the gamm function in mgcv).

References

Akaike, Hirotogu. 1998. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Selected Papers of Hirotugu Akaike, 199–213. Springer.
Bin, Chi, Adam Dennett, Thomas Oléron-Evans, and Robin Morphet. 2024. “House Price Per Square Metre in England and Wales, 1995-2024.” https://reshare.ukdataservice.ac.uk/855033/.
Comber, Lex, Paul Harris, and Chris Brunsdon. 2024. stgam: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models. https://CRAN.R-project.org/package=stgam.
Hurvich, Clifford M, and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76 (2): 297–307.
Marra, Giampiero, and Simon N Wood. 2011. “Practical Variable Selection for Generalized Additive Models.” Computational Statistics & Data Analysis 55 (7): 2372–87.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64.
Wood, Simon. 2021. mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. https://CRAN.R-project.org/package=mgcv.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with r. CRC press.