stgam
The stgam
package (Comber,
Harris, and Brunsdon 2024) provides a wrapper around
mgcv
functionality (S. Wood
2021) to support space-time analysis, with a focus on both
prediction but also importantly on inference (or process understanding).
The latter is commonly the objective of geographical analysis, which are
often concerned with how processes and statistical relationships vary
over space and / or time.
This vignette illustrates a workflow for analyses to quantify and
model variations over space and time based on GAM smooths. The case
study examines house price (actually price per unit area) for terraced
houses in London 2018 to 2024 and highlights the importance of
investigating the data for spatial and / or temporal variability before
constructing space-time GAMs and undertaking the stgam
workflow.
You should load the stgam
package and the data. The data
are sample of terraced house sales data for 2018 to 2024 extracted from
Bin et al. (2024), and located via the ONS
lookup tables that allows UK postcodes to be linked to location
(actually the geometric centroids of the postcode area).
library(cols4all)
library(dplyr)
library(ggplot2)
library(tidyr)
library(sf)
library(cowplot)
# load the package and data
library(stgam)
data("hp_data")
data("lb")
You should examine the help for the datasets and especially the
variables that hp_data
contains:
The first stage is simply examine the data, particularly the continuous variables over space and time. The code below examines the data and undertakes some initial investigations.
# examine what is loaded
hp_data
#> # A tibble: 1,888 × 13
#> price priceper tfa dot yot beds type cef pef ageb lad X Y
#> <dbl> <dbl> <dbl> <date> <dbl> <int> <chr> <int> <int> <chr> <chr> <int> <int>
#> 1 1000 7463. 134 2023-02-22 2023 9 T 60 82 1930-1949 E09000014 531674 188707
#> 2 520 7879. 66 2022-05-27 2022 3 T 57 74 1930-1949 E09000031 536888 189269
#> 3 325 5078. 64 2021-01-15 2021 4 T 50 82 1950-1966 E09000016 549763 190717
#> 4 300 2400 125 2020-01-17 2020 5 T 64 85 1900-1929 E09000031 538646 186513
#> 5 585 7222. 81 2020-11-13 2020 4 T 75 88 1983-1990 E09000027 514710 172535
#> 6 618. 6786. 91 2023-05-18 2023 4 T 61 84 1950-1966 E09000008 533233 170230
#> 7 1183. 12117. 97.6 2022-04-22 2022 5 T 63 66 1900-1929 E09000027 516471 174772
#> 8 800 6154. 130 2023-09-28 2023 7 T 60 80 1900-1929 E09000010 532032 194831
#> 9 560 6022. 93 2022-12-21 2022 5 T 47 79 1930-1949 E09000029 526674 164298
#> 10 300 3750 80 2019-09-16 2019 4 T 71 86 1950-1966 E09000004 548369 179349
#> # ℹ 1,878 more rows
The print out of the first 10 records in hp_data
indicates that we have a time variable (dot
) in date format
as well as location in metres (X
and Y
). The
analyses will model price per unit area (the priceper
variable in hp_data
). This is in units of pounds per spare
metre and GAM smooth models will be constructed that regress this
against other variables in the data including location and time.
However, it often more useful to have a continuous scalar value for
the time variable, in this case the date of sale transfer
(dot
), and to have location in kilometres rather than
metres. The code below uses the earliest date in the dataset to create a
variable called days
to represent time (here 100s of days
since the earliest date) and rescales the locational data, but retaining
the original coordinates for mapping:
hp_data <-
hp_data |>
# create continuous time variable
mutate(days = as.numeric(dot - min(dot))/100) |>
relocate(days, .after = dot) |>
# scale location and retain original coordinates
mutate(Xo = X, Yo = Y) |>
mutate(X = X/1000, Y = Y/1000)
Again this can be examined:
The data and the target variable can be mapped as in the code snippet
below. Note the use of the Xo
and Yo
coordinates in the mapping with ggplot
.
# map the data layers
lb |>
ggplot() + geom_sf() +
geom_point(data = hp_data |> st_drop_geometry(), aes(x = Xo, y = Yo, col = priceper)) +
scale_color_viridis_c(option = "magma") +
theme_bw() +xlab("") + ylab("")
priceper
variable (house price per square metre in
£s).The map above shows the spatial distribution of priceper
and also indicates that there were no terraced house sales in the centre
of London (in the City of London local authority district) in this
period. In many of the analyses below the lb
spatial
dataset of London boroughs will be used to provide spatial context to
the results. To do this the coordinates fo the spatial data need to be
rescaled in the same way the X
and Y
variables
in hp_data
were rescaled to aid model construction. The
code below creates a rescaled version of the lb
projected
in kilometres (remember the original data can always be reloaded using
data(lb)
).
#transform to km
lb <- st_transform(lb, pipeline = "+proj=pipeline +step +proj=unitconvert +xy_out=km")
# remove the projection to avoid confusing ggplot
st_crs(lb) <- NA
The correlations and distributions of the numeric variables can be
examined. This is important in order to establish that any regression of
the proposed target variable against the predictor variables may yield
something interesting and sensible, and just as importantly to identify
any variables that might be a problem when trying to make a model. For
example, in this case it is good that the proposed target variable
(priceper
) has a healthy tail.
The code below generates boxplots and then histograms.
# boxplots
hp_data |>
select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
pivot_longer(-lad) |>
ggplot(aes(x = value), fil) +
geom_boxplot(fill="dodgerblue") +
facet_wrap(~name, scales = "free") +
theme_bw()
# histograms
hp_data |>
select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
pivot_longer(-lad) |>
ggplot(aes(x = value), fil) +
geom_histogram(aes(y=after_stat(density)),bins = 30,
fill="tomato", col="white") +
geom_density(alpha=.5, fill="#FF6666") +
facet_wrap(~name, scales = "free") +
theme_bw()
Examining correlations is for similar reasons: to check for issues that may be a problem later on in the analysis workflow. Here the aim is to establish that there are reasonable correlations with the target variable and lack of collinearity amongst the predictor variables, when they are examined without considering time or space (i.e. they are examined globally).
# correlations
hp_data |>
select(priceper, price, tfa, beds, cef, pef) |>
cor() |> round(3)
#> priceper price tfa beds cef pef
#> priceper 1.000 0.723 0.308 0.179 -0.118 -0.220
#> price 0.723 1.000 0.791 0.549 -0.082 -0.216
#> tfa 0.308 0.791 1.000 0.780 -0.034 -0.202
#> beds 0.179 0.549 0.780 1.000 -0.079 -0.170
#> cef -0.118 -0.082 -0.034 -0.079 1.000 0.388
#> pef -0.220 -0.216 -0.202 -0.170 0.388 1.000
Finally, thinking about space-time analysis of your dataset, as a general heuristic, any dataset for use in space-time analysis should have a minimum of about 100 locations and a minimum of 50 observation time periods. You should consider this when you are assembling your data.
The investigations in the previous section were all concerned with
establishing the viability of undertaking the proposed analysis, using
standard exploratory data analysis approaches to understand
distributions and correlations. In this section these are extended to
test for the presence of variability in the data but this now over time,
space and space-time. This is done by constructing a series of
regression models, of different forms and with different parameters.
Tools for constructing Generalized Additive Models (GAMs) using the
mgcv
package (S. Wood 2021)
are introduced but with particular focus on examining using GAM smooths
or splines to explore variations over space and time.
As an initial baseline model, the code below creates a standard OLS
regression model of priceper
using the lm
function. The model fit is weak, but the model summary indicates that
the some of the variables (pef
and beds
) are
significant predictors of the target variable. The analysis of variance
(ANOVA), calculated using the anova
function, shows how
much variance in priceper
is explained by each predictor
and whether that contribution is statistically significant. Here higher
values in the F value
test statistic provides stronger
evidence the predictor is useful.
So there is evidence that all three predictors (cef
,
pef
, beds
) are statistically significant and
the F-values and associated p-values (Pr(>F)
) show
strong evidence that each variable contributes meaningfully to
explaining variation in priceper
. The model likely has
decent explanatory power, assumptions of no major issues like
multicollinearity.
# an OLS model
m_ols <- lm(priceper ~ cef + pef + beds, data = hp_data)
summary(m_ols)
#>
#> Call:
#> lm(formula = priceper ~ cef + pef + beds, data = hp_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7057.6 -1752.2 -554.9 1220.4 23271.9
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10751.489 690.879 15.562 < 2e-16 ***
#> cef -9.633 6.325 -1.523 0.128
#> pef -59.626 8.047 -7.410 1.90e-13 ***
#> beds 298.329 46.327 6.440 1.52e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2740 on 1884 degrees of freedom
#> Multiple R-squared: 0.06997, Adjusted R-squared: 0.06849
#> F-statistic: 47.25 on 3 and 1884 DF, p-value: < 2.2e-16
anova(m_ols)
#> Analysis of Variance Table
#>
#> Response: priceper
#> Df Sum Sq Mean Sq F value Pr(>F)
#> cef 1 2.1265e+08 212648515 28.323 1.149e-07 ***
#> pef 1 5.4018e+08 540183718 71.948 < 2.2e-16 ***
#> beds 1 3.1134e+08 311344258 41.468 1.516e-10 ***
#> Residuals 1884 1.4145e+10 7508013
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A final and quick trick to confirm the presence of space and time
effects is to use them as a dummy variable with one of the variables.
Here this is done with tfa
(total floor area) that
unsurprisingly is highly correlated with the target variable
(priceper
):
# Dummy with LADs
m_dummy1 <- lm(priceper ~ tfa:lad, data = hp_data)
summary(m_dummy1)
anova(m_dummy1)
This shows that there are important and significant differences
between the first London borough (which has a lad
value
equal to “E09000001”) that is not shown in the model summary and many of
the the rest that are, suggesting some important spatial differences in
the interaction of floor area and price in different boroughs. This is
confirmed by the F values in the ANOVA.
# Dummy with Time
m_dummy2 <- lm(priceper ~ tfa:days, data = hp_data)
summary(m_dummy2)
anova(m_dummy2)
Similarly, an even larger effect with time is demonstrated when
days
is used as the dummy variable.
Thus, despite the models being globally weak with low \(R^2\) values, there is evidence of spatial and temporal interactions with the target variable that warrant further more formal exploration with respect to potential temporal and spatial trends. GAMs with smooths offer a route to investigate these.
GAMs with splines (or smooths - the terms are used interchangeably) can be used to explore and quantify spatial and temporal variations. However before considering spatial and temporal aspects, it is important to outline what splines do.
Splines help to fit a (smooth) curve through a cloud of messy data points — like drawing a line that follows a winding mountain road. They generate flexible (bendy) fits (curves) that are able to adapt to the shape of your data, and thus describe the relationship between variables better than a straight line. For example, consider the scatterplot of some simulated data below: it would be difficult to fit a straight line through it.
# create simulated data
set.seed(12)
x <- runif(500)
mu <- sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
y <- rnorm(500, mu, .3)
# plot x and y
ggplot() +
geom_point(aes(x,y)) +
theme_bw()
x
and y
data.The code below uses a GAM spline to determine the relationship
between x
and y
without having to pre-specify
any particular form (e.g. quadratic, exponential etc). The bendiness or
wiggliness of the fit is determined automatically by the s
function in different ways depending on the type of smooth (via
penalization and basis functions for thin plate smooths, the default and
in smooths created using s
in the mgcv
package). Effectively what this does is split the data into chunks and
fit piecewise polynomial relationships (rather than fitting a single
straight line).
# a GAM illustration with a spline using s
gam_s_example <- gam(y~s(x))
# extract the smooth fit
y.s <- gam_s_example$fitted.values
# plot
ggplot() +
geom_point(aes(x,y), col = "grey") +
geom_line(aes(x, y = y.s), lwd = 1) +
theme_bw()
x
and y
data
with GAM a Spline fitted.The purpose of this diversion into splines and smooths is to
illustrate how they model different (slopes) relationships with
y
in different locations along the x
feature
space. In this way spline smooth curves can be used to capture
non-linear relationships in attribute space (here the relationship of
x
with y
). Importantly, in the context of
space-time varying coefficient modelling with stgam
, the
spline can be used to model how relationships between x
and
y
varies with respect to time or location if the smooth is
also parameterised with those.
It would be useful to examine the how the target variable,
priceper
, varies over time. One way of doing this is to
extend the use GAM smooths illustrated above. The code snippet below
seek to model the variation in priceper
but this time using
time (the days
variable).
# the first GAM
gam.1 <- gam(priceper~s(days), data = hp_data)
summary(gam.1)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(days)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 64.11 106.6 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(days) 3.798 4.708 15.76 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.0374 Deviance explained = 3.93%
#> GCV = 7.7786e+06 Scale est. = 7.7588e+06 n = 1888
The smooth can be investigated by plotting it with time. Note the use
of the actual data variable dot
rather than
days
in the code below - this is simply to have a friendly
x-axis on the plot! - and the use of the predict
function
to extract the standard errors:
# create a data frame with x, predicted y, standard error
x <- hp_data$dot
y <- gam.1$fitted.values
se <- predict(gam.1, se = TRUE, hp_data)$se.fit
u <- y+se
l <- y-se
df <- data.frame(x, y, u, l)
# plot!
ggplot(df, aes(x, y, ymin = l, ymax = u)) +
geom_ribbon(fill = "lightblue") +
geom_line() +
theme_bw() +
xlab("Date") + ylab("priceper")
This shows variation in the modelled relationship of
priceper
with time. The code above extracted the predicted
values (\(\hat{y}\)) of
priceper
from the model and plotted them against time, but
using the actual date not the continuous variable. The plot shows how
the relationship of the target variable varies with time and potentially
reflecting the effects over the pandemic, with increases in the value of
homes with there own gardens and outdoor spaces (even small ones!)
rather than apartments or flats. This provides confirmation of the
potential suitability of a temporal analysis of
priceper
.
The spatially varying trends can be examined in similar way, again
using a standard s
spline with location.
# the second GAM
gam.2 <- gam(priceper~s(X,Y), data = hp_data)
summary(gam.2)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 43.62 156.6 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.47 28.85 81.52 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.554 Deviance explained = 56.1%
#> GCV = 3.6469e+06 Scale est. = 3.5919e+06 n = 1888
plot(gam.2, asp = 1)
mgcv
plot of the spatial
smooth.Notice that the model fit (\(R^2\))
has increased substantially from the previous models. This suggest that
this variable does vary over space. Again the smooth can be investigated
by plotting, but this time rather than a line it is 2-Dimensional
surface. However, it would be useful to to visualise this as a surface
rather than just over the hp_data
point locations. The code
below uses the predict
function used to model the
relationship over a hexagonal grid covering the extent of the London
boroughs, with values for X
and Y
of the
centroid of hexagonal cell, and plots the surface over geographic
space:
# 1. create a grid object the study area from the LB data
l_grid <-
st_make_grid(lb, square=FALSE,n=50) |>
st_sf() |>
st_join(lb) |>
filter(!is.na(name))
# rename the geometry, sort row names
st_geometry(l_grid) = "geometry"
rownames(l_grid) <- 1:nrow(l_grid)
# create and add coordinates X and Y
coords <- l_grid |> st_centroid() |> st_coordinates()
#> Warning: st_centroid assumes attributes are constant over geometries
l_grid <- l_grid |> bind_cols(coords)
You may wish to inspect this object:
Before continuing with the mapping procedure:
# 2. predict over the grid
yhat <- predict(gam.2, newdata = l_grid)
l_grid |> mutate(yhat = yhat) |>
# 4.and plot
ggplot() +
geom_sf(aes(fill = yhat), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "priceper") +
# add context
geom_sf(data = lb, fill = NA) +
# apply and modify plot theme
theme_bw() +
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"))
This models the spatial trends in the variations over space of the
target variable. The results indicate similar high relationships with
space in the centre of the study area as in the initial map of the data
earlier, and again provides confirmatory evidence that the some
spatially explicit modelling of priceper
is appropriate in
this case.
The spline can be extends to model the target variable in space-time
by including the location and time variables in the smooth. However, a
slight different set of considerations are required because we are
mixing space and time in the mgcv
smooths. Consider the
code snippet below for the extension into space-time smooths: it
specifies a smooth using the t2
function rather than
s
and contains other parameters in the d
argument.
# the third GAM
gam.3 <- gam(priceper~t2(X,Y,days, d = c(2,1)), data = hp_data)
summary(gam.3)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ t2(X, Y, days, d = c(2, 1))
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 42.57 160.5 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> t2(X,Y,days) 47.87 52.07 37.39 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.576 Deviance explained = 58.6%
#> GCV = 3.5118e+06 Scale est. = 3.4209e+06 n = 1888
It is important to consider the difference between
s(X,Y,days)
(which would a continuation of the format above
but has not been run) and t2(X,Y,days,d=c(2,1))
before
examining the resultant GAM (gam.3
). The spline functions
s()
and t2()
can model smooth interactions
between multiple variables, they handle dimension scaling, marginal
smoothness, and basis construction in different ways.
s()
is an Isotropic smooth and is used when
all variables are on the same scale and have similar smoothness
behaviours (like X
and Y
). By default it uses
a thin plate regression spline (bs = "tp"
) and a single
basis to model the attribute space. The formulation
s(X,Y,days)
would therefore use a single basis to model a
smooth in 3D space, under the assumption that all the variables
(X
, Y
and days
) contribute in the
same way and at the same scale to the smooth. This is fine if the
variables were numeric, continuous, and in comparable units such as
percentages. But not space and time.
t2()
is a Tensor Product (TP) smooth and is
used when variables are on different scales or represent different kinds
of effects (like X
and Y
, with
days
). It builds a smooth using separate basis functions
for each variable and then combines them. The formulation
t2(X,Y,days,d=c(2,1))
includes the d
parameter
that specifies the dimensions of each basis, a 2D basis is specified for
X
and Y
and a 1D basis for days
.
The tensor product smooth is constructed by constructing (marginal)
smooths for each variable and taking the tensor product of those to form
the joint smooth. It is useful when the variables have different scales
or units (e.g., X
and Y
are coordinates,
days
is time) and different degrees of smoothness are
expected along each variable.
Thus in this case the TP smooth is specified because space and time
are combined. The code below examines the results and uses the
mgcv
plot function to show the spatial distributions over 9
discretised time chunks (~310 days).
However it is also possible to slice and dice the time variable in
other ways. The code snippet below illustrates how this can be done for
365 day periods using the calculate_vcs
function in the
stgam
package. If predictor variables are included in the
terms
parameter then the function coefficient estimates
(“b_”) and standard errors (“se_”) for each of terms specified. However,
in this case we are just interested in the predicted value of the target
variable, yhat
. The code below uses the grid object created
above (l_grid
) as a spatial framework to hold the
prediction of the priceper
variable over these discrete
time periods.
# 1. create time intervals (see the creation of days variable above)
pred_days = seq(0, 2190, 365)/100
# 2. create coefficient estimates for each time period (n = 7)
res_out <- NULL
for (i in pred_days){
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
mgcv_model = gam.3,
terms = NULL)
res_out <- cbind(res_out, res.i$yhat)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y_", 2018:2024)
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(-name, -lad, -X, -Y) |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = paste0("Y_", 2018:2024))) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "Predicted \n'priceper'") +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
What this time series set of maps show is that without any
exploratory variables, the trends in priceper
vary
spatially and change in intensity over time (the increase and decrease
in values over space), with limited variation in their spatial
pattern.
The formula used to construct the gam.3
model in the
previous subsection constructed a smooth that implicitly combined space
and time, under an assumption that the variations in space and time of
priceper
interact with each other. However this may not be
the case. More generally spatial and temporal dependencies might not be
expected to interact in data for relatively small region, whose
observations might be expected to be subject to the same socio-economic
or environmental pressures. Instead, any space and time effects might be
expected to be independent of each other. Whereas in a national
study any space and time effects might be expected to interact
more strongly.
It is possible to use a different construction involving separate
smooths for space and time, with the s()
splines, and to
avoid the assumption of the interaction of space and time effects:
# the fourth GAM
gam.4 <- gam(priceper~s(X,Y) + s(days), data = hp_data)
summary(gam.4)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y) + s(days)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 42.45 161 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.509 28.861 83.76 <2e-16 ***
#> s(days) 3.536 4.392 23.88 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.578 Deviance explained = 58.5%
#> GCV = 3.4606e+06 Scale est. = 3.4019e+06 n = 1888
Again there is evidence of spatial and temporal trends in the data
and these can be visualised using the plot.gam
function in
mgcv
(called with just plot
).
mgcv
plot of the spatial and
temporal smooths.Also, the space-time interactions could be generated in the same way
as for the gam.3
model above, using the X
and
Y
values in l_grid
and the time slices in
pred_days
.
Up until now only the the response (target) variable,
priceper
has been examined against time and location in GAM
smooth specified in different ways, with the aim of exploring the
spatial and temporal variability of the variable, but also to introduce
GAM smooths. The code below includes the pef
predictor
variable in a GAM model as a fixed parametric term (i.e. in the same way
as in the OLS model created above).
# the fifth GAM
gam.5 <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8826.648 408.776 21.593 < 2e-16 ***
#> pef -24.802 5.057 -4.905 1.02e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.507 28.861 77.85 <2e-16 ***
#> s(days) 3.473 4.316 24.89 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.583 Deviance explained = 59%
#> GCV = 3.4216e+06 Scale est. = 3.3618e+06 n = 1888
The model summary indicates that pef
is a significant
predictor of priceper
and improves the model fit. However,
this raises a key question of how should predictor variables be included
in in the model: in parametric form or in a smooth? This is addressed in
the next section.
As before the smooths can be plotted:
This section has introduced GAMs with splines / smooths as a way of
investigating any space-time effects present in the data and their
characteristics. The mechanics of splines was described and illustrated,
and a brief introduction to different spline forms was provided.
Variations in the response variable (priceper
) in time,
space and space-time were explored through elementary model fitting and
plotting the smooth graphical trends. In this case, effects in space and
time are present in the priceper
predictor variable and it
looks like there is a universal spatial trend at all times and universal
temporal trend everywhere. The final GAM model included explanatory
variables (pef
and beds
) which added some
explanatory power but these were not examined with respect to space or
time (i.e. in smooths). This is done in the next section.
Such investigations are an important initial step. They guide subsequent analysis rather than making assumptions about space time interactions and plugging every predictor variable into a space-time smooth of some kind. They provide evidence of space-time variations in the response variable - i.e. whether the effects in space and time are present in the data - and can help determine which predictor variables may be of further interest.
The previous section examined variability fo the response variable in
time, space and space-time (the latter in 2 different ways). In this
section these investigations are extended to consider the variation of
the response variable (priceper
) with the pef
predictor variable, but this time with the Intercept as an addressable
term in the model.
Consider the gam.5
model created above:
Note that the gam.5
model includes the terms
s(X,Y) + s(days)
. These model the spatially varying and
temporally varying Intercept plus error. Note that if this was a spatial
model (without time), then just s(X,Y)
would be included
for the Intercept and similarly just s(days)
for a purely
temporal model.
It gam.5
model can be reformulated as follows with an
addressable Intercept term:
gam.5.new <- gam(priceper ~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + pef,
data = hp_data |> mutate(Intercept = 1))
summary(gam.5)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8826.648 408.776 21.593 < 2e-16 ***
#> pef -24.802 5.057 -4.905 1.02e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.507 28.861 77.85 <2e-16 ***
#> s(days) 3.473 4.316 24.89 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.583 Deviance explained = 59%
#> GCV = 3.4216e+06 Scale est. = 3.3618e+06 n = 1888
The model summary now includes Intercept
as a parametric
term (rather than (Intercept)
) and
s(X,Y):Intercept
and s(days):Intercept
as
smooth terms rather than s(X,Y)
and s(days)
,
but the values in both summaries are all the same.
The formula in subsequent models in this vignette include the Intercept as an addressable term. The code below creates this variable in the input data and in order to facilitate mapping of the resulting coefficient estimates, the models in this section are specified with the original coordinates:
The code below specifies a GAM regression model with smooths for
space and time as in the previous section but now with the
pef
predictor variable, included in parametric form and in
a temporal smooth with days
. This suggests in this case
that it is important to model this relationship over time.
gam.t <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + s(days, by = pef),
data = hp_data)
summary(gam.t)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(days, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8738.828 410.218 21.303 < 2e-16 ***
#> pef -11.846 2.538 -4.668 3.27e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 27.526 28.864 78.064 < 2e-16 ***
#> s(days):Intercept 3.507 4.358 4.389 0.00119 **
#> s(days):pef 1.500 1.500 6.255 0.00294 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 49/50
#> R-sq.(adj) = 0.584 Deviance explained = 94%
#> GCV = 3.4159e+06 Scale est. = 3.3544e+06 n = 1888
The summary of this model indicates that the relationship of
pef
with priceper
changes over time and has a
strong linear negative trend (the smooth s(days):pef
is
significant as is the pef
parametric coefficient). The
nature of the temporally varying coefficients can be examined using the
caculate_vcs
function in the stgam
package.
This returns a tibble
with coefficient estimates (“b_”) and
standard errors (“se_”) for each of terms specified.
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.t, terms = c("Intercept", "pef"))
head(vcs)
#> # A tibble: 6 × 22
#> price priceper tfa dot days yot beds type cef pef ageb lad X Y Xo Yo Intercept
#> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <int> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <int> <int> <dbl>
#> 1 1000 7463. 134 2023-02-22 18.8 2023 9 T 60 82 1930-19… E090… 532. 189. 531674 188707 1
#> 2 520 7879. 66 2022-05-27 16.0 2022 3 T 57 74 1930-19… E090… 537. 189. 536888 189269 1
#> 3 325 5078. 64 2021-01-15 11.1 2021 4 T 50 82 1950-19… E090… 550. 191. 549763 190717 1
#> 4 300 2400 125 2020-01-17 7.44 2020 5 T 64 85 1900-19… E090… 539. 187. 538646 186513 1
#> 5 585 7222. 81 2020-11-13 10.4 2020 4 T 75 88 1983-19… E090… 515. 173. 514710 172535 1
#> 6 618. 6786. 91 2023-05-18 19.6 2023 4 T 61 84 1950-19… E090… 533. 170. 533233 170230 1
#> # ℹ 5 more variables: b_Intercept <dbl>, se_Intercept <dbl>, b_pef <dbl>, se_pef <dbl>, yhat <dbl[1d]>
The temporal variation of the relationship of pef
with
the target variable can be plotted, and in this case the linear trend is
confirmed: the negative relationship of pef
with the target
variable does not change over time.
vcs |>
mutate(u = b_pef + se_pef,
l = b_pef - se_pef) |>
ggplot(aes(x = dot, y = b_pef, ymin = l, ymax = u)) +
geom_ribbon(fill = "lightblue") +
geom_line() +
theme_bw() +
xlab("Date") + ylab("pef")
pef
.The spatial variation in the relationship of the predictor variable
with the target variable can also be examined. The code below specifies
the pef
predictor variable within a spatial smooth, but
again with a spatially and temporally varying intercept. Notice how
pef
is included in both the smooth and as parametric
(global) term.
gam.s <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + s(X,Y, by = pef),
data = hp_data)
summary(gam.s)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(X, Y, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8229.6 442.6 18.593 <2e-16 ***
#> pef -111.8 168.1 -0.665 0.506
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 29.000 29.000 5.742 < 2e-16 ***
#> s(days):Intercept 3.536 4.391 24.374 < 2e-16 ***
#> s(X,Y):pef 21.034 25.628 2.067 0.00122 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 69/70
#> R-sq.(adj) = 0.594 Deviance explained = 94.2%
#> GCV = 3.3672e+06 Scale est. = 3.269e+06 n = 1888
Here is is evident at pef
is varying significantly over
space (see s(X,Y):pef
) but is not significant as a global
fixed term. That is, the coefficient estimate slope as a global slope is
not significantly different from zero, but is when varying over space
and allowing for Intercept to vary over space and time. This finding
could be related to the age of the houses, which were built in clusters
in different locations.
Again the varying coefficients can be extracted from the model for
the predictor variable and mapped both at observation locations and over
the hexagonal grid if a time period for the Intercept smooth in
gam.s
is specified:
# 1.over observation locations
vcs <- calculate_vcs(input_data = hp_data,
mgcv_model = gam.s,
terms = c("Intercept", "pef"))
tit <-expression(paste(""*beta[`pef`]*""))
p1 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) +
scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw() +
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"),) +
xlab("") + ylab("")
# 2. over grid - recall it needs an intercept term and a days value!
vcs <- calculate_vcs(input_data = l_grid |> mutate(Intercept = 1, days = mean(hp_data$days)),
mgcv_model = gam.s,
terms = c("Intercept", "pef"))
p2 <-
ggplot() +
geom_sf(data = vcs, aes(fill = b_pef), col = NA) +
scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw()+
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"),) +
xlab("") + ylab("")
plot_grid(p1, p2)
pef
over the original observations and the hexagonal grid.Here the relationship of pef
, the Potential energy
efficiency rating, varies over space with a distinct inflection from
negative values in the centre of the study area to positive ones in
outer regions.
It is also possible to examine the space-time dependencies between
the target variable and the pef
predictor variable. The
code below specifies a GAM with a space-time smooth, with a spatially
and temporally varying intercept. As before, a single TP smooth is
specified with the dimensions of each basis (a 2D basis for
X
and Y
; a 1D basis for days
), as
different degrees of smoothness are expected across space and time.
gam.st1 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + t2(X,Y,days, d= c(2,1), by = pef),
data = hp_data)
summary(gam.st1)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + t2(X, Y, days, d = c(2, 1), by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8686.866 411.590 21.106 < 2e-16 ***
#> pef -10.827 2.262 -4.786 1.83e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 27.514 28.862 65.183 < 2e-16 ***
#> s(days):Intercept 3.519 4.372 4.383 0.00119 **
#> t2(X,Y,days):pef 5.555 5.556 5.862 1.63e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 164/165
#> R-sq.(adj) = 0.584 Deviance explained = 94%
#> GCV = 3.4246e+06 Scale est. = 3.3556e+06 n = 1888
The pef
variable is still not significant globally, but
it is over space and time (t2(X,Y,days):pef
). When the
space-time trends are examined (see the figures below), the increasingly
negative effect over time is evident and the spatial distributions
indicate a general lower relationship with the target variable in the
east fo the study area and increasingly negative one to the west.
# calculate the varying coefficient estimates
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.st1, terms = c("Intercept", "pef"))
# temporal trends
p_time <-
vcs |>
select(dot, b_Intercept, b_pef) |>
pivot_longer(-dot) |>
mutate(name = recode(name,
"b_Intercept" = '""*beta[Intercept]',
"b_pef" = '""*beta[pef]')) |>
ggplot(aes(x = dot, y = value)) +
geom_point(alpha = 0.1) +
geom_smooth() +
facet_wrap(~name, labeller = label_parsed, scale = "free", ncol = 1) +
theme_bw() + xlab("Year") + ylab("")
# spatial trends
tit <-expression(paste(""*beta[`Intercept`]*""))
p_sp1 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_Intercept), alpha = 1) +
scale_colour_continuous_c4a_seq("brewer.yl_gn_bu", name = tit) +
theme_bw() +
xlab("") + ylab("")
tit <-expression(paste(""*beta[`pef`]*""))
p_sp2 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) +
scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw() +
xlab("") + ylab("")
plot_grid(p_time, plot_grid(p_sp1, p_sp2, ncol = 1), nrow = 1, rel_widths = c(3.5,6))
pef
.Of course it is also possible to plot time slices of the spatial
distribution of the relationship of pef
with
priceper
over grids as was done before. Notice in the code
below how the terms for the Intercept and pef
are extracted
fro each time period in the looped call to calculate_vcs
.
This east-west trend and its changes over time are confirmed.
# 1. create time intervals (as above)
pred_days = seq(0, 2190, 365)/100
# 2. create coefficient estimates for each time period (n = 7)
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in pred_days){
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
mgcv_model = gam.st1,
terms = c("Intercept", "pef"))
# select just the coefficient estimates of interest
res.i <- res.i |> st_drop_geometry() |> select(starts_with("b_pef"))
res_out <- cbind(res_out, res.i)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("PEF", "_", 2018:2024)
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(starts_with("PEF")) |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = colnames(res_out))) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_div(name = "Potential Energy\nEfficiency") +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
pef
coefficient estimate.The previous section created a GAM model with a single space-time
smooth for the pef
predictor variable. It is also possible
to include them as separate smooths, as was done to investigate the
variability of the target variable. The code snippet below does this
again with a spatially and temporally varying intercept.
gam.st2 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept)
+ pef + s(X,Y, by = pef) + s(days, by = pef), data = hp_data)
summary(gam.st2)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(X, Y, by = pef) + s(days, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8204.971 442.323 18.550 <2e-16 ***
#> pef -5.215 13.861 -0.376 0.707
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 28.000 28.000 5.870 < 2e-16 ***
#> s(days):Intercept 3.579 4.443 3.896 0.00277 **
#> s(X,Y):pef 21.654 26.022 2.039 0.00156 **
#> s(days):pef 1.333 1.333 1.432 0.32924
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 77/80
#> R-sq.(adj) = 0.595 Deviance explained = 94.2%
#> GCV = 3.3618e+06 Scale est. = 3.2623e+06 n = 1888
Here the fixed global term is again not significant, and
interestingly neither is separate temporal smooth. The plots fo the
smooths confirm this. Together this confirms what was found in the step
above in the previous subsection with a combined TP smooth and indicates
that there is a spatial trend in the relationship of pef
with priceper
that changes in intensity over time but not
in its spatial distribution. This suggests a super-imposition of spatial
and temporal trends: the temporal smooths with days
is not
significant.
mgcv
plots of the seperate space
and time smooths for the pef
predictor variable.This section has applied smooths to a single predictor variable. These have explored time, space and space-time, by constructing smooths parameterised with time and location. These allow changes in the relationship between the predictor variable and the target variable to be examined over space and time. Importantly, a space-time GAM was constructed in 2 different: by including space and time in a singe smooth and then using separate space smooths and time smooths. These reflect different assumptions about the nature of the space-time dependencies in the data and result in different inferences (understanding) about the process if the significant components of the model are considered. This may not be important if the objective of constructing the GAM-based varying coefficient models is prediction, but is if the objective is process understanding, as is commonly the case in geographical analyses of space-time data.
There 2 important implications of this:
stgam
: model selectionThe models in construction in Section 3 used a variety of smooths.
The Intercept was modelled a parametric term with separate space and
time smooths, and the pef
predictor variable was included
in parametric form, in parametric form with a single space-time smooth
and in parametric form with separate space time smooths. This poses the
question of which form to use and how to specify the model? Which is
best? A key focus of the stgam
package is to seek to answer
this question. It does this by creating and evaluating multiple
models.
One way to evaluate models is to compare them using some metric.
Commonly AIC (Akaike 1998) corrected AIC
(AICc) (Hurvich and Tsai 1989) or BIC
(Schwarz 1978) are used to compare models
as they provide a parsimonious measures model fit (balancing model
complexity and prediction accuracy). However, there are some
uncertainties over using these to compare mgcv
GAM smooth
models due to variations in the effective degrees of freedom (EDF)
introduced by the smooths. The result that model EDFs vary depending on
how the smoothing parameters are selected, potentially leading to
over-penalization or under-penalization in BIC calculations. As a result
model Generalized Cross-Validation (GCV) score is recommended for
evaluating and comparing mgcv
GAMs (S. N. Wood 2017; Marra and Wood 2011). GCV
provides an un-biased risk estimator (UBRE) estimate of model fit. It is
similar to BIC as it is able to balance model fit with complexity, but
does not suffer from some of problems of using BIC when applied to GAMs
(e.g. bias in non-parametric model comparison, over-penalises complex
smooths etc). The best model is one that minimises the
GCV score.
The code below extracts the GCV from each of the models in the last
section and prints them out in order. Here it can be seen that the GAM
model with pef
specified in separate space-time smooths
generated the best model.
In a space-time model there are 6 options for each predictor variable:
The intercept can be treated similarly, but without it being absent (i.e. 5 options). Obviously there are 3 options for spatial models (i., ii. and iii.) and for temporal models (i., ii. and iv.), with each having 2 options for the intercept. Thus for a purely spatial or purely a temporal regression with \(k\) predictor variables there are \(2 \times 3^k\) potential models and for a space-time regression there are \(5 \times 6^k\) potential models to evaluate.
Recall that the hp_data
contains a number of numeric
variables that could be of use in explaining the space-time variations
in the priceper
target variable:
head(hp_data)
#> # A tibble: 6 × 17
#> price priceper tfa dot days yot beds type cef pef ageb lad X Y Xo Yo Intercept
#> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <int> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <int> <int> <dbl>
#> 1 1000 7463. 134 2023-02-22 18.8 2023 9 T 60 82 1930-19… E090… 532. 189. 531674 188707 1
#> 2 520 7879. 66 2022-05-27 16.0 2022 3 T 57 74 1930-19… E090… 537. 189. 536888 189269 1
#> 3 325 5078. 64 2021-01-15 11.1 2021 4 T 50 82 1950-19… E090… 550. 191. 549763 190717 1
#> 4 300 2400 125 2020-01-17 7.44 2020 5 T 64 85 1900-19… E090… 539. 187. 538646 186513 1
#> 5 585 7222. 81 2020-11-13 10.4 2020 4 T 75 88 1983-19… E090… 515. 173. 514710 172535 1
#> 6 618. 6786. 91 2023-05-18 19.6 2023 4 T 61 84 1950-19… E090… 533. 170. 533233 170230 1
These include cef
(Current energy efficiency rating)
pef
(Potential energy efficiency rating) and
beds
(Number of bedrooms), as well as location
(X
and Y
) and time (days
). Thus
with \(k = 3\) variables could are 1080
potential models to evaluate in space-time case study and 54 potential
models in a spatial case study. Here 2 of the variables are used to
illustrate an evaluation of 180 space-time models, pef
as
before and beds
.
The evaluate_models()
function in the stgam
package constructs and compares the full set of potential models. It
uses the GCV value of each model to evaluate and rank them. Note that in
the code below, ncores
is set to 2 pass CRAN diagnostic
checks - you may want to specify more using detectCores()-1
from the parallel
package. Remember that the input data
needs to have a defined Intercept term which can be created in the
following way input_data |> mutate(Intercept = 1))
(this
was done at the start of Section 3 for hp_data
).
library(doParallel)
t1 <- Sys.time()
stvc_mods <- evaluate_models(
input_data = hp_data,
target_var = "priceper",
vars = c("pef", "beds"),
coords_x = "X",
coords_y = "Y",
VC_type = "STVC",
time_var = "days",
ncores = 2)
Sys.time() - t1 # about 14 minutes (6 minutes with 15 cores!)
The best 10 models can be extracted (i.e. those with the lowest GCV
score) using the gam_model_rank()
function in the
stgam
package introduced in Section 3. Interestingly the
top 5 ranked models all have space and time specified for each predictor
variable, but in different combinations of single and separate
space-time smooths. The top 4 models all have the beds
predictor variable specified in separate space time smooths, suggesting
that while there are space and time dependences with the target
variable, there are not space-time ones.
mod_comp <- gam_model_rank(stvc_mods)
# have a look
mod_comp |> select(-f)
#> Rank Intercept pef beds GCV
#> 1 1 t2_ST s_T + s_S s_T + s_S 16813.82
#> 2 2 s_T + s_S s_T + s_S s_T + s_S 16815.01
#> 3 3 t2_ST t2_ST s_T + s_S 16815.67
#> 4 4 s_T + s_S t2_ST s_T + s_S 16822.48
#> 5 5 s_T + s_S s_T + s_S t2_ST 16823.02
#> 6 6 s_S s_T + s_S s_T + s_S 16824.47
#> 7 7 t2_ST s_S s_T + s_S 16826.02
#> 8 8 t2_ST s_T + s_S t2_ST 16826.23
#> 9 9 t2_ST s_T + s_S s_S 16827.10
#> 10 10 s_T + s_S s_S s_T + s_S 16827.90
A final model for use in analysis can be specified in the form of the
top ranked model above. This included the Intercept in a single
space-time TP smooth and in separate space and time smooths for
pef
and beds
.
First the formula is extracted and can be inspected:
f <- as.formula(mod_comp$f[1])
f
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) +
#> s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) +
#> s(days, by = beds)
Then this is put into a mgcv
GAM model, and checked for
over-fitting using the k.check
function in
themgcv
package. It underpins the gam.check
function but does not display the diagnostic plots. Here the
k-index
values are near to 1, the k'
and
edf
parameters are not close, and importantly the
edf
values are all much less than the k'
values, so this model is reasonably well tuned. NB if
the k'
and edf
parameters are close for some
smooths, then you may want to increase the k
parameter in
the relevant smooth. These are automatically determined by the
mgcv
package but can be specified manually - see the
mgcv
help for k.check
and
choose.k
.
gam.m <- gam(f, data = hp_data, method = "REML")
k.check(gam.m)
#> k' edf k-index p-value
#> t2(X,Y,days):Intercept 124 30.729788 0.8165558 0.0000
#> s(X,Y):pef 30 3.055193 0.6432838 0.0000
#> s(days):pef 10 2.183809 1.0422023 0.9675
#> s(X,Y):beds 30 24.222744 0.6432838 0.0000
#> s(days):beds 10 2.936562 1.0422023 0.9550
A summary of the model can be examined and this shows that nearly all
of the terms are significant except the temporal smooth for
beds
.
summary(gam.m)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) +
#> s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) +
#> s(days, by = beds)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 10040.1 454.6 22.09 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> t2(X,Y,days):Intercept 30.730 37.032 2.236 3.43e-05 ***
#> s(X,Y):pef 3.055 3.082 3.189 0.01350 *
#> s(days):pef 2.184 2.673 5.151 0.00223 **
#> s(X,Y):beds 24.223 26.353 5.647 < 2e-16 ***
#> s(days):beds 2.937 3.349 1.062 0.31011
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 203/205
#> R-sq.(adj) = 0.601 Deviance explained = 61.5%
#> -REML = 16814 Scale est. = 3.2129e+06 n = 1888
The spatially and temporally varying coefficients estimates generated
by the model can be extracted in a number of different ways using the
calulate_vcs
function in the stgam
package, as
indicated in earlier sections. There are basically 2 options for
generating the varying coefficient estimates: i) over the original data
points or ii) over spatial surfaces for specific time slices.
For the first option the original data, the GAM model and a vector of
the model terms (i.e. predictor variables names) are simply passed to
the calculate_vcs
function. The second option requires
slices of the space-time data to be passed to the
calulate_vcs
function and a deeper consideration of how the
results are processed and combined. In both cases the results can be
summarised, mapped.
Considering first the data of the original observations:
vcs <- calculate_vcs(input_data = hp_data,
mgcv_model = gam.m,
terms = c("Intercept", "pef", "beds"))
A summary over space and time of the coefficient estimates shows that
the intercept the is large and positive and that pef
and
beds
are generally negative but with some positive values
at in some places or at some times.
vcs |> select(starts_with("b_")) |>
apply(2, summary) |> round(1)
#> b_Intercept b_pef b_beds
#> Min. 4236.0 -60.0 -552.4
#> 1st Qu. 8303.6 -34.5 -369.7
#> Median 10187.8 -26.3 -260.3
#> Mean 10040.1 -26.1 -240.8
#> 3rd Qu. 11703.2 -17.6 -163.4
#> Max. 15778.7 7.7 388.0
These can be examined over time:
vcs |> select(dot, starts_with("b_")) |>
rename(`Intercept` = b_Intercept,
`Potential Energy Efficiency` = b_pef,
`Bedrooms` = b_beds) |>
pivot_longer(-dot) |>
mutate(name = factor(name, levels=c("Intercept","Potential Energy Efficiency", "Bedrooms"))) |>
group_by(dot, name) |>
summarise(
lower = quantile(value, 0.25),
median = median(value),
upper = quantile(value, 0.75)
) |>
ggplot(aes(x = dot, y = median)) +
geom_point(col = "blue", alpha = 0.2) +
geom_smooth() +
facet_wrap(~name, scale = "free_y") +
theme_bw() + xlab("") + ylab("") +
theme(strip.background = element_rect(fill="white"))
#> `summarise()` has grouped output by 'dot'. You can override using the `.groups` argument.
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Standard dplyr
and ggplot
approaches can be
used to join and map the coefficient estimates as in the figure below,
which summarises the coefficient estimates for pef
(b_pef
) by year.
# make spatial data
vcs_sf <- vcs |> st_as_sf(coords = c("X", "Y"), remove = F)
# plot
ggplot() +
geom_sf(data = lb) +
geom_sf(data = vcs_sf, aes(col = b_pef)) +
scale_colour_continuous_c4a_div(palette="brewer.rd_bu",
name = "Potential\nEnergy Efficiency") +
facet_wrap(~yot) +
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
pef
(Potential Energy
Efficiency) coefficient estimates over space and time.Finally the l_grid
can be used summarise the coefficient
estimates over space and time as well, here summarising each year,
approximately described in the pred_days
object defined
above. Notice how the for
loop below combines all the
coefficient estimates this time.
The maps reflect the relatively small changes in the relationship
with priceper
over time plotted above.
# create time slices
years <- 2018:2024
# calculate over the grid for each time slice
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in 1:length(years)){
# convert years to days
day.val = (years[i]-2018) * 365 / 100
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = day.val),
mgcv_model = gam.m,
terms = c("Intercept", "pef", "beds"))
# select all the coefficient estimates
res.i <-
res.i |>
st_drop_geometry() |>
select(starts_with("b_"),
starts_with("se_"))
# rename them
names(res.i) <- paste0(names(res.i), "_", years[i])
# bind to the result
res_out <- cbind(res_out, res.i)
cat(years[i], "\t")
}
#> 2018 2019 2020 2021 2022 2023 2024
# join to the grid
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(starts_with("b_beds")) |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = paste0("b_beds_", 2018:2024))) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = "Bedrooms") +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
beds
(Bedrooms)
coefficient estimates over time and the grid surface.This section has illustrated the the use of functions in the the
stgam
package. It suggests the following workflow:
data.frame
,
tibble
or sf
object containing the data to
have single location and time variables for each observation (in the
above examples these were X
,Y
and
days
), and an Intercept as an addressable term.This evaluates and ranks multiple models using model GCV value. This
was done algorithmically using the evaluate_models
function. However, this was not undertaken in isolation. Rather it built
on the investigations in Section 3 to determine whether space and time
effects were present in the data. This set of model investigations were
gone through to both develop and confirm knowledge of processes related
to house prices in London: that is the analysis was both considered and
contextual. For example, in Section 3 it was determined that a varying
intercept term was appropriate and with a different dataset there may be
a need to explore different avenues.
Similarly more of the model space could have been examined, for
example to include cef
in the models. The
stgam
package allows these choices to be validated through
an automated approach, providing an exploration of the full set of
potential choices. The the GCV as an unbiased risk estimator was useful
helping to evaluate models
All of the space time GAM models in this vignette have been specified
with method = "REML"
. Model estimation in GAMs can be
conducted in two ways: (a) predictive (i.e., out-of-sample minimization)
via Generalized Cross-Validation (GCV) and (b) Bayesian (i.e. attach
priors on basis coefficients) via restricted maximum likelihood (REML).
REML tends to provide more stable estimates of the smoothing parameters
(i.e., reduce over-fitting) and tends to provide more robust estimates
of the variance. GCV is more computationally efficient but can over-fit,
especially when errors are correlated. GCV is set as the default in
mgcv
.
In Section 4.3, the GAM model was checked for under- and over-fitting
using the k.check
function in themgcv
package
the advice was to increase k
in the smooths if the
k'
and edf
parameters were close. This can be
done by specifying it manually in the smooth as in the hypothetical
example below.
The number of knots in the smooth (\(k\)) determines the complexity of the
response-to-predictor variable smooths in the GAM. However, determining
k
is a balance between setting it too large which can
result in over-fitting and under-fitting if it is too low. Computational
issues (speed and model stability) can arise if k
is set to
be too large, especially in the space-time case for small datasets.
We think these talks by Noam Ross provide really useful tips and
pointers for working with GAMs and mgcv
:
The rationale for using GAMs with TP smooths and GP bases for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows:
The workflow suggested in this vignette and in the stgam
package is to determine the most appropriate model form (both steps
evaluated by minimising a Generalized Cross-Validation (GCV), an
un-biased risk estimator the model recommended for evaluating
mgcv
GAMs (S. N. Wood 2017)).
This avoids making potentially unreasonable assumptions about how space
and time interact in spatially and temporally varying coefficient (STVC)
models. The best model can then be extracted an examined for the nature
of the space-time interactions it suggests, before generating the
varying coefficient estimates and mapping or plotting the results.
Future developments will seek to examine the scales of spatial
dependencies in the data sing kriging based approaches and will extend
the stgam
package to work with large data (i.e. to draw
from the functionality of the gamm
function in
mgcv
).