One of irtsim’s two stated goals is paper reproducibility: users should be able to run the Monte Carlo examples from Schroeders and Gnambs (2025), “Sample Size Planning for Item Response Models: A Tutorial for the Quantitative Researcher” (https://ulrich-schroeders.github.io/IRT-sample-size/), and compare irtsim output against the published reference.
This vignette is the honest scorecard for that goal as of the current release. It documents which of the paper’s three examples irtsim can reproduce end-to-end today, which it cannot, and what architectural changes would close the remaining gaps.
| Example | Paper scenario | irtsim status | Where to find it |
|---|---|---|---|
| Example 1 | 1PL linked two-form design, 30 items, 438 MC iterations, MSE criterion | Reproducible | vignette("paper-example-1-linked-design") |
| Example 2 | 2PL with bivariate θ + external criterion, MCAR, SE of cor(θ, criterion) via TAM latent-regression β | Not reproducible — architectural gaps | See “Example 2 gap” below |
| Example 3 | GRM with leave-one-measure-out, RMSE of
testinfo(mod, θ = 2.0) / (1 + testinfo(mod, θ = 2.0)) |
Not reproducible — fitted-model access gap | See “Example 3 gap” below |
The paper’s Example 2 constructs a bivariate latent trait (θ,
external criterion ξ) with population correlation ρ = 0.5, generates
responses to 30 2PL items, applies MCAR missingness at 0%/33%/67%, fits
the model with TAM::tam.mml() (or
tam.mml.2pl()) using the external criterion as a latent
regressor, and extracts
SE(cor(θ, ξ)) = tam.se(mod)$beta[2, 2].
What irtsim is missing:
irt_design()
accepts a single theta_dist (a distribution name or a
function producing a univariate θ vector). There is no way to declare a
bivariate (θ, external criterion) generating distribution.fit_model() wraps mirt::mirt() directly and
has no provision for passing an external criterion column through to a
latent-regression estimator.mirt(..., covdata = ...), but irtsim does not expose
that surface.criterion_fn in summary.irt_results() receives
only (estimates, true_value, ci_lower, ci_upper, converged)
— all item- scoped. There is no way to hand the callback a fitted model,
a θ estimate vector, or an external covariate vector, so even an ad hoc
post-hoc reconstruction of the paper’s β SE is not expressible
today.Obj 30 in the project backlog tracks closing this gap.
The paper’s Example 3 calibrates a 50-item GRM composed of three
clinical symptom scales and computes, per iteration,
testinfo(mod, Theta = 2.0) / (testinfo(mod, Theta = 2.0) + 1)
— the conditional reliability at a target θ value. The criterion
reported is the RMSE of that quantity across iterations.
What irtsim is missing:
run_one_iteration() in irt_simulate.R calls
extract_params() immediately after fit_model()
returns and then discards the fitted mirt object. Nothing downstream has
access to mod for test-info queries.extract_fn(mod, data) that would let a
user pull arbitrary quantities out of the fitted model (test info at θ,
reliability at θ, discriminant validity, etc.) and attach them to the
per-iteration result store.criterion_fn in
summary.irt_results(), the callback sees only item-level
parameter estimates — not per-iteration scalar extracts that depend on
the fitted model.Obj 31 in the project backlog tracks closing this gap.
irtsim’s pipeline was originally scoped tightly around item
parameter recovery — estimands like bias, MSE, RMSE, coverage,
and empirical SE on a, b, b1..bk.
That scope is well served by the current
(estimates, true_value, ci_lower, ci_upper) interface. The
paper’s Examples 2 and 3, however, target estimands that are not
item-scoped parameters at all:
Closing both gaps with a narrow, one-off hook for each estimand would
entangle irtsim with TAM and mirt::testinfo
and would keep adding scope for every new paper example that showed
up.
The project is considering a single pluggable-hook addition that subsumes both gaps without hard-coding any specific backend:
irt_simulate(
study,
iterations = 438,
seed = 2024,
parallel = TRUE,
fit_fn = my_fit_fn, # user-supplied: fits a model their way
extract_fn = my_extract_fn # user-supplied: returns a named list of
# per-iteration scalars / vectors
)The returned irt_results object would gain a third store
alongside item_results and theta_results —
call it extracted_results — in which each named output of
extract_fn becomes a column. Users would then write their
own summary/recommended_n logic on that slot,
or irtsim would provide a thin convenience that computes Morris-style
criteria on each extracted column.
This approach:
tam.se(mod)$beta[2, 2];testinfo(mod, Theta = 2.0) / (1 + testinfo(...));The tradeoff is that the fit/extract contract becomes a new public surface that has to be documented, versioned, and tested.
Even without the pluggable hook, the paper’s examples can be partially reproduced by using irtsim for the pieces it handles well and stitching in external code for the rest:
mirt::testinfo), and
collection of per-iteration scalar extracts.This is a reasonable pattern while Obj 30 / Obj 31 are deferred. It also argues against inflating irtsim’s dependency surface.
If your goal is a published-reference comparison for irtsim’s core
workflow — linked-test design, Rasch fitting, MSE of item difficulty
vs. sample size — Example 1 gives you that comparison as a rendered
vignette today. See
vignette("paper-example-1-linked-design") for the faithful
reproduction and a table of the design decisions mapped from the paper
onto irtsim’s API.
Schroeders, U., & Gnambs, T. (2025). Sample size planning for item response models: A tutorial for the quantitative researcher. Companion code: https://ulrich-schroeders.github.io/IRT-sample-size/.