---
title: "A one-predictor worked example"
subtitle: "Walking step by step through the ErrorTracer pipeline"
author: "Luis Javier Madrigal-Roca"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{A one-predictor worked example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = FALSE,   # Stan/brms chunks are too slow for automated builds
  fig.width = 7,
  fig.height = 4.5
)
```

# What this vignette is for

The methods paper for `ErrorTracer` contains a one-page **Box 1**
that walks through every stage of the pipeline using a single
predictor and a single response. The box gives the *intuition* in a
compressed form. This vignette is the long-form companion: same
example, same numbers, but with the code, the math, and the diagnostic
plots laid out one stage at a time so a reader new to Bayesian
modelling can follow along, run it, and modify it.

The whole example fits one model: phenology shift (days) as a
function of a single standardised temperature anomaly. We *know* the
data-generating coefficients because we simulate the data ourselves
— so you can see how well each stage recovers what was put in.

```{r}
library(ErrorTracer)
library(ggplot2)
library(patchwork)
```

# Setup: simulate one predictor, one response — over years

Shelf life is a *temporal* concept ("for how many years is my forecast
still useful?"), so the toy data is generated as a year-indexed time
series. The temperature predictor `x` drifts upward at a fixed warming
rate, the response `y` follows it through a linear coefficient
$\beta$, and the training window is the past while the forecast
window is the future.

```{r}
set.seed(20260523L)
n_train       <- 12L      # 12 training years
true_beta     <- 4.0      # days of phenology shift per SD of temperature
true_sigma    <- 1.5      # residual SD (days)
true_alpha    <- 0.0      # intercept
warming_slope <- 0.15     # SD of standardised temperature per year
year0         <- 2000L    # year at which the trend is referenced to zero

years_train <- year0 + seq_len(n_train) - 1L  # 2000..2011
train <- data.frame(
  year = years_train,
  x    = warming_slope * (years_train - year0) +
         rnorm(n_train, 0, 0.5)               # noisy observed temperature
)
train$y <- true_alpha + true_beta * train$x +
           rnorm(n_train, 0, true_sigma)
```

We have 12 noisy training observations along the warming trajectory.
The data-generating slope is $\beta = 4$ days per SD of temperature;
everything below should recover something close to this.

# Stage 1: turn a quick frequentist fit into a prior

`extract_priors()` is the bridge from a "fast, regularized" world to
a "slow, Bayesian" world. The recipe is the same for `lm`, `glm`,
`cv.glmnet`, and `ranger`: take the point estimate (or coefficient)
as the prior mean and scale the prior SD by the standard error
(or by the regularised coefficient magnitude, for `glmnet`).

```{r}
lm_fit  <- lm(y ~ x, data = train)
summary(lm_fit)$coefficients
priors  <- extract_priors(lm_fit, multiplier = 2.0, min_sd = 0.1)
priors
```

Concretely, with $m = 2$ (the `multiplier` default) and
$\sigma_{\min} = 0.1$, the prior on $\beta$ is

$$
\beta \sim \mathcal{N}\!\left(\hat\beta_{\text{lm}},
                              \bigl[\max(2 \cdot \mathrm{SE},\,
                                         \sigma_{\min})\bigr]^2\right).
$$

For our simulated data, $\hat\beta_{\text{lm}} \approx 4.02$ with
$\mathrm{SE} \approx 0.39$, so the prior is roughly
$\mathcal{N}(4.02, 0.79^2)$.

# Stage 2: Bayesian refit with the informed prior

```{r}
bfit <- et_fit(
  formula = y ~ x,
  data    = train,
  priors  = priors,
  chains  = 4L, iter = 2000L, warmup = 1000L,
  cores   = 4L, seed = 1L, silent = 2L, refresh = 0
)
print(bfit)
```

Under the hood `et_fit()` calls `brms::brm()` with the prior we
supplied for $\beta$ plus weakly-informative defaults for the
intercept and the residual scale. The output is a full posterior:

```{r}
beta_draws <- as.matrix(bfit$fit, variable = "b_x")[, 1]
c(mean = mean(beta_draws), sd = sd(beta_draws))
```

For our seed, the posterior collapses to $\beta \approx 4.01$ with SD
$\approx 0.38$. The data didn't drag the prior very far because the
prior was *already* centred on the lm estimate — but the posterior
SD is sharper than the prior SD, which is the regular Bayesian
shrinkage we wanted.

A picture is worth a thousand draws. The package ships a helper:

```{r}
et_plot_prior_posterior(bfit)
```

# Stage 3: posterior prediction across the forecast horizon

The forecast window is years 2012–2045. At each future year, the
projected temperature is the trend-line value
$x(t) = \mathrm{warming\_slope}\cdot(t - 2000)$. We pass
`env_noise = list(x = 0.3)` to tell `et_predict()` that the projected
temperature is itself uncertain by $\pm 0.3$ SD (a reasonable error
for a downscaled climate projection).

```{r}
year_min     <- year0
year_max     <- year0 + 45L
years_grid   <- seq(year_min, year_max, by = 1L)
forecast_df  <- data.frame(
  year = years_grid,
  x    = warming_slope * (years_grid - year0)
)

pred <- et_predict(
  model             = bfit,
  newdata           = forecast_df,
  env_noise         = list(x = 0.3),
  n_draws           = 2000L,
  n_perturb         = 500L,
  ci_levels         = 0.90,
  include_env_in_ci = TRUE
)
```

Setting `include_env_in_ci = TRUE` folds the environmental noise into
the reported credible intervals. With it `FALSE` (the default), the CI
captures parameter + residual uncertainty only; environmental
variance is still reported separately in `decompose_uncertainty()`,
but the user-facing CI is "what would the response be if we knew the
future temperature exactly". The choice depends on whether you're
forecasting *under* projection uncertainty or *given* a known
driver.

# Stage 4: decompose the variance over time

`decompose_uncertainty()` returns one row per forecast year with the
three (or four, if you have autocorrelation) variance components:

```{r}
decomp <- decompose_uncertainty(pred)
decomp$year <- forecast_df$year
head(decomp)
```

Three things to notice as the forecast horizon extends into the
future (year increases, so $x$ on the warming trajectory increases
too):

1. `param_var` grows roughly as $\hat\sigma_\beta^2 \cdot x(t)^2$.
   That's the *extrapolation cost* of being uncertain about $\beta$:
   the further past the training window you forecast, the more it
   matters.
2. `env_var` is roughly constant at $\hat\beta^2 \cdot \sigma_x^2$.
   It scales with how steep the dose–response is *and* how poorly
   we know the future dose, but it does not grow with horizon.
3. `residual_var` is the biological-noise floor. It does not change
   with the forecast year because it's a property of the response,
   not of when you forecast.

A quick numerical look at a representative mid-forecast year:

```{r}
mid <- which(decomp$year == 2020L)
decomp[mid, ]
```

# Stage 5: when does the forecast become uninformative?

A forecast is *informative* when the predictive CI is narrower than
the plausible range of the response. The plausible range here is "any
phenology shift between $-4$ and $+4$ days" — anything wider than
that and we are basically saying "we have no idea".

```{r}
sl <- shelf_life(
  predictions    = pred,
  response_scale = c(-4, 4),
  ci_level       = 0.90,
  threshold      = 1.0,
  time_col       = "year"
)
attr(sl, "horizon")
```

`shelf_life()` walks along the forecast time axis and reports the
first *year* at which the 90 % CI width first exceeds the response
range. For this example it lands at $t^{\star} \approx 2027$ — i.e.
the model still produces a useful forecast for roughly 16 years past
the end of training; beyond that, the credible interval is wider
than the plausible response and should not be over-interpreted.
That answer — a calendar year, not a predictor value — is the
quantity ecologists and conservation practitioners actually need.

# The whole-pipeline figure

The four diagnostic panels — prior vs posterior, the nested CI fan,
the stacked variance components, and the shelf-life curve — are
produced together by the script that generates Box 1 of the paper.
The exact same code is what appears in the box's figure; see the
package source under `src/box1_worked_example.R` in the
proof-of-concept repository (or run the chunk below in your own
session):

```{r, eval = FALSE}
# Reproduce the four-panel Box 1 figure used in the methods paper.
source(system.file("extdata", "box1_worked_example.R",
                   package = "ErrorTracer"))
run_box1_worked_example(figures_dir = tempdir(),
                        tex_path    = tempfile(fileext = ".tex"))
```

# What this example does *not* show

Two limits of this minimal walk-through deserve flagging:

- **Small $n$, single predictor.** With 12 observations and one
  driver, the Bayesian fit is doing real work — informative priors
  matter. Real workflows usually have larger $n$, multiple correlated
  predictors, and a regularised feature-selection step (elastic net)
  upstream of `extract_priors()`. The fundamental decomposition is
  the same.
- **No grouping.** `et_fit(..., grouping = "g")` runs the same
  pipeline independently per level of a grouping factor (per SNP
  cluster, per population, per site). The toy example fits a single
  global model; the multi-group form returns a named list of
  `et_model` objects, and `decompose_uncertainty()` / `shelf_life()`
  preserve the grouping.

# Session info

```{r, eval = TRUE}
sessionInfo()
```