GLAMMGoF is an R package for resampling-based predictive
assessment of generalized linear and generalized additive models. The
two main functions in the package, bias_precision() and
brier_auc(), quantify in-sample and out-of-sample
predictive performance by repeatedly refitting a model structure to
re-sampled training data and evaluating predictions on held-out testing
data. These functions support model objects from a variety of R
packages, including glmmTMB() from glmmTMB,
lmer(), glmer(), and glmer.nb()
from lme4, glm.nb() from MASS,
gam() from mgcv, and glm() and
lm() from the stats package. Optional residual diagnostics
can also be performed using the DHARMa package.
bias_precision() is intended for models with
continuous or integer response variables (e.g., Gaussian, Poisson,
negative binomial, Tweedie, or Gamma distributions) and returns relative
root mean squared error (RRMSE), relative mean absolute error (RMAE),
relative median absolute error (RMedAE), and relative bias
(RBIAS).
brier_auc() is intended for models with binary
response variables (i.e., logistic regression) and returns AUC, Brier
score, and log loss.
Both functions accept models with or without random effects. By
default, predictive performance is based on population-level (marginal)
predictions, meaning random effects are excluded during prediction
(re.form = ~0). This behavior can be changed via the
conditional_predictions argument (see the Jensen’s
inequality section below for details).
Two resampling methods are available via the method argument: repeated random holdout (method = “holdout”, the default), which repeatedly partitions the data into training and testing subsets and is intuitive and widely used; and bootstrap resampling with out-of-bag evaluation (method = “bootstrap”), which samples with replacement and tends to make more efficient use of the available data, particularly for smaller data sets.
Important note: The functions in this package take an existing model fitted to the full data set and assess the predictive performance of that model structure through repeated resampling. In each replicate, the model is refit to a re-sampled training data set and evaluated on separate testing data. This workflow is intentional: fitting the final model to all available data maximizes information available for inference, while resampling-based predictive assessment provides a practical way to evaluate generalization without permanently withholding data. This approach is especially useful in ecological applications where sample sizes are often limited.
In general, method = "holdout" more closely reflects
prediction to independent unseen observations and may be more intuitive
for assessing generalization, whereas method = "bootstrap"
makes more efficient use of the available data by evaluating performance
on out-of-bag observations and may provide more stable estimates of
predictive performance when sample sizes are small.
You’ve either downloaded and installed GLAMMGoF from
R-universe, GitHub, DevOps, or a local zipped package file. Once
installed, you can access the vignette by running
browseVignettes("GLAMMGoF"). Note that ggplot2
is a dependency, and functions from numerous other R packages are
imported and used as well.
There are only two functions in this package, and which one you use depends on the type of model being assessed:
bias_precision(): used for models with continuous or
integer response variables and estimates four predictive performance
metrics using repeated resampling: RRMSE,
RMAE, RMedAE, and RBIAS. For all
four metrics, values closer to 0 indicate better predictive performance
(see ?bias_precision for details). All three accuracy
statistics (RRMSE, RMAE, and
RMedAE) express prediction error as a percentage of the
mean observed value, which places them on a common scale regardless of
the units or magnitude of the response. To recover the corresponding raw
error in the original units of the response, multiply any of these
values by the observed mean and divide by 100. For example, an
RRMSE of 18% with a sample mean of 50 implies a root mean
squared error of 9 in the original units.
RRMSE (Relative Root Mean Squared Error): The square
root of average squared prediction errors, expressed as a percentage of
the mean observed value. Because errors are squared before averaging,
RRMSE penalizes large individual errors more heavily than small ones,
making it sensitive to cases where the model produces occasional extreme
predictions.
fit_cost_rrmse <- function(y, yhat) sqrt(mean((yhat - y)^2)) / mean(y) * 100RMAE (Relative Mean Absolute Error): The average
absolute prediction error expressed as a percentage of the mean observed
value. RMAE treats all errors proportionally to their size
without the extra weight that squaring applies, and is often the most
intuitive summary of typical prediction accuracy.
fit_cost_rmae <- function(y, yhat) mean(abs(yhat - y)) / mean(y) * 100RMedAE (Relative Median Absolute Error): The median
absolute prediction error expressed as a percentage of the mean observed
value. Because it is based on the median rather than the mean of the
error distribution, RMedAE is the most robust of the three
to outlying predictions and gives the best picture of accuracy for a
typical observation when the error distribution is skewed.
fit_cost_rmedae <- function(y, yhat) median(abs(yhat - y)) / mean(y) * 100
RRMSE is notably larger than
RMAE, that signals a handful of high-leverage predictions
are inflating the squared-error average, which is a pattern
RMedAE will often fail to reflect if the bulk of
predictions are accurate. Conversely, close agreement among all three
suggests errors are roughly symmetric and there are no extreme outliers
driving the summary.RBIAS (Relative Bias): The mean signed prediction
error expressed as a percentage of the mean observed value, where
positive values indicate systematic over-prediction and negative values
indicate systematic under-prediction. RBIAS is independent
of the accuracy metrics above: a model can be nearly unbiased on average
yet still produce large errors, or it can be highly biased while still
ranking observations correctly. Reporting RBIAS alongside
RRMSE and RMAE therefore distinguishes random
prediction noise from systematic directional error.
fit_cost_rbias <- function(y, yhat) mean(yhat - y) / mean(y) * 100brier_auc(): used for models with binary response
variables and estimates three predictive performance metrics using
repeated resampling: AUC, Brier score, and
log loss. AUC values closer to 1 (bounded
between 0 and 1) indicate better classification performance, whereas
Brier score (bounded between 0 and 1) and
log loss (lower bound of 0, no upper bound) values closer
to 0 indicate better probabilistic predictions (see
?brier_auc for details).
The three binary metrics each capture a different aspect of predictive performance and are most informative when interpreted together.
AUC (Area Under the ROC Curve): The probability that
a randomly chosen positive case receives a higher predicted probability
than a randomly chosen negative case, ranging from 0 (no discrimination)
to 1 (perfect discrimination), and where an AUC of 0.5
indicates that a model predicts no better than random chance (and
AUC <0.5 is worse than random). AUC is
purely a measure of discrimination (the model’s ability to rank cases by
their probability of the outcome) and is insensitive to how well the
predicted probabilities are calibrated in absolute terms. This function
also returns the null model (i.e., intercept-only) AUC
statistic, which is assumed to be 0.5, which allows for comparison to
the summarized AUC metrics.
Brier Score: Across all observations, the mean
squared difference between predicted probabilities and observed binary
outcomes (0/1), ranging from 0 (perfect) to 1 (worst). The
Brier score jointly rewards good discrimination and good
calibration, penalizing confident but wrong predictions more heavily
than uncertain ones. A model can have high AUC but a poor
Brier score if its predicted probabilities are
systematically too high or too low even while correctly ordering cases.
This function also returns the null model (i.e., intercept-only)
Brier score for reference (i.e., the best predictive
performance achievable without any predictors), computed as
p_bar * (1 - p_bar), where p_bar is the
observed event rate, which allows for comparison to the summarized Brier
score metrics. Note that the null Brier score is based on the arithmetic
mean of the response, which approximates but is not identical to
plogis(intercept) from a fitted intercept-only logistic model; the
difference is negligible in practice and avoids the overhead of model
fitting.
Log Loss (Binary Cross-Entropy): Across all
observations, the mean negative log-likelihood of the observed outcomes
under the predicted probabilities, where lower values indicate better
performance. Like the Brier score, log loss
rewards calibration, but its logarithmic scale means it penalizes
extreme, confidently wrong predictions far more aggressively.
Log loss is therefore the most sensitive of the three to
cases where the model assigns near-zero probability to an outcome that
actually occurs. This function also returns the null model (i.e.,
intercept-only) log loss for reference (i.e., the best predictive
performance achievable without any predictors), computed as
-(p_bar * log(p_bar) + (1 - p_bar) * log(1 - p_bar)), where
p_bar is the observed event rate, which allows for
comparison to the summarized log loss metrics. Note that the null log
loss is based on the arithmetic mean of the response, which approximates
but is not identical to plogis(intercept) from a fitted intercept-only
logistic model; the difference is negligible in practice and avoids the
overhead of model fitting. Log loss is computed internally as:
log_loss <- function(y, p, eps = 1e-15) {p <- pmax(pmin(p, 1 - eps), eps) -mean(y * log(p) + (1 - y) * log(1 - p))}
AUC
reflects whether the model correctly separates cases from non-cases;
Brier score provides an overall summary of probability
accuracy on a squared-error scale; and log loss highlights
whether the model is making any dangerously overconfident errors that
the Brier score might underweight.Both functions assess in- and out-of-sample predictive performance
for generalized linear and generalized additive models, with or without
random effects, by repeatedly resampling data and evaluating predictive
performance across multiple replicates. Results are stored across all
replicates and reported as a summary table (mean and 95% confidence
intervals) and a histogram of replicate values for each performance
statistic, with a blue dotted vertical line indicating the mean across
replicates. For brier_auc(), a red dashed vertical line is
also displayed in the log loss panels, indicating the null
(intercept-only) log loss baseline as a reference point for model
comparison.
Two resampling methods are available via the method argument. The
default (method = "holdout") uses repeated random holdout
(Monte Carlo cross-validation), which splits the data into a training
set (80% by default, controlled by propTrain) and a test
set (the remaining 20%). The model is refit to the training data and
performance is evaluated on both the training data (in-sample
performance) and the withheld test data (out-of-sample performance).
This method is intuitive and widely understood, but the fixed
training/test split means the training set is always smaller than the
full data set, which can be a limitation for smaller data sets.
Alternatively, method = "bootstrap" uses bootstrap
resampling with replacement, where each bootstrap training data set
contains the same number of observations as the original data set,
sampled with replacement. Out-of-sample performance is evaluated on the
out-of-bag observations not selected in the bootstrap sample
(approximately 36.8% of observations on average). This method makes more
efficient use of the available data and tends to produce more stable
estimates, particularly for smaller data sets, but is slightly less
intuitive than the holdout approach.
For well-behaved models and reasonably sized data sets, both methods should produce similar results. Differences are most likely to emerge with small data sets, highly overdispersed data, or poorly specified models.
For zero-heavy count or continuous data, the choice of performance
metric matters. RBIAS in particular can behave unexpectedly
when the response contains zeros or values near zero, because the
relative error formulation divides by the observed value, producing
unstable or extreme terms that can dominate the average and give a
misleading impression of systematic underprediction. This is a known
limitation of relative bias metrics rather than evidence of poor model
fit (Hyndman and Koehler 2006), and is most likely to occur when zeros
are common but the zero-generating process is not explicitly
modeled.
In these situations, RMedAE or RMAE may
provide more reliable summaries of predictive performance.
RMedAE is robust to extreme errors because it is based on
the median absolute error rather than the mean, and therefore better
reflects predictive accuracy for a typical observation.
RMAE can also be useful because absolute-error metrics
weight errors linearly rather than quadratically, making them less
sensitive to extreme observations than RMSE-based
measures.
When combined with the DHARMa zero-inflation test
(testZI = TRUE), these metrics provide a more complete
assessment than any single measure alone. RMedAE or
RMAE summarize predictive accuracy, while the ZI test
evaluates whether the model correctly reproduces the observed frequency
of zeros. A significant ZI test result with
ratioObsSim > 1 indicates that the model is
underpredicting zeros, suggesting that a zero-inflated model family or
additional sources of heterogeneity should be considered.
This entire process is repeated nReps times, and the
resulting sampling distributions of the performance metrics are
summarized. Because RRMSE, RMAE,
RMedAE, RBIAS, AUC,
Brier score, and log loss quantify different
aspects of predictive performance, they should be interpreted as
complementary rather than interchangeable
There are two example data sets, countData and
logitData, with integer and binary response variables,
respectively. Each data set also includes a variety of continuous and
categorical predictor variables. Additionally, there are 12 example
model objects (6 beginning with count and 6 beginning with
logit): Model_GLM, Model_GLMM,
Model_GLMM2, Model_GAM,
Model_GAMM, and Model_GAMM2. Models ending in
M include only fixed effects, whereas models ending in
MM and MM2 include one and two random effects,
respectively. All 12 example models were fitted using
glmmTMB, but GLAMMGoF supports models fitted
using a variety of compatible R packages
bias_precision() functionTo use this function properly, you must:
Be sure you’re using it with the correct model type, which in
this case is one with a continuous or integer response variable. Here, a
variety of distributions (e.g., Gaussian, Poisson, negative binomial,
Tweedie, and gamma) are supported, as are numerous R packages
(glmmTMB, lme4 (including glmer,
glmer.nb, and lmer), MASS
(glm.nb), mgcv (gam), and
glm/lm from the stats
package).
The only other choices you need to make are:
The number of bootstrap or Monte Carlo replicates you want to use
(the default is nReps=100 but this should be at least 500
or 1000).
For method = "holdout" only: The proportion of data
you want to use for assessing in-sample predictive performance (the
default is 0.8).
Whether you want to use the DHARMa package’s
simulateResiduals() function to assess and plot the model’s
residuals (the default is DHARMaPlot=TRUE), whether you
also want DHARMa's to test for zero inflation using the
testZeroInflation() function (the default is
testZI=TRUE), and how many simulation replicates you want
DHARMa to use (the default is
DHARMaReps=1000).
Now let’s run it and see what happens…
bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")Look at the first few fit statistics…
| simRep | Group | Metric | value |
|---|---|---|---|
| 1 | In-sample performance | RRMSE | 105.96 |
| 1 | Out-of-sample performance | RRMSE | 98.81 |
| 1 | In-sample performance | RMedAE | 47.52 |
| 1 | Out-of-sample performance | RMedAE | 48.30 |
| 1 | In-sample performance | RMAE | 69.30 |
| 1 | Out-of-sample performance | RMAE | 68.08 |
| 1 | In-sample performance | RBIAS | -4.76 |
| 1 | Out-of-sample performance | RBIAS | -0.79 |
And then look at a summary of the nReps values…
| Group | Metric | mn | lwr95 | upr95 |
|---|---|---|---|---|
| In-sample performance | RRMSE | 104.83 | 101.13 | 107.46 |
| In-sample performance | RMAE | 68.92 | 67.68 | 70.10 |
| In-sample performance | RMedAE | 47.42 | 45.76 | 48.98 |
| In-sample performance | RBIAS | -4.77 | -6.16 | -3.22 |
| Out-of-sample performance | RRMSE | 104.25 | 93.90 | 118.32 |
| Out-of-sample performance | RMAE | 69.17 | 64.06 | 76.01 |
| Out-of-sample performance | RMedAE | 47.44 | 41.29 | 54.63 |
| Out-of-sample performance | RBIAS | -4.27 | -16.88 | 11.77 |
And now look at a plot of the summaries
Now let’s compare to the bootstrap method, which
generally should be very similar for well-behaved models and reasonably
sized data sets.
bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")Summary of bootstrap method
| Group | Metric | mn | lwr95 | upr95 |
|---|---|---|---|---|
| In-sample performance | RRMSE | 104.68 | 96.95 | 112.63 |
| In-sample performance | RMAE | 68.68 | 65.40 | 72.17 |
| In-sample performance | RMedAE | 47.24 | 44.61 | 50.68 |
| In-sample performance | RBIAS | -5.13 | -7.97 | -2.41 |
| Out-of-sample performance | RRMSE | 104.66 | 93.40 | 113.78 |
| Out-of-sample performance | RMAE | 69.33 | 65.18 | 73.82 |
| Out-of-sample performance | RMedAE | 47.64 | 41.25 | 53.65 |
| Out-of-sample performance | RBIAS | -4.52 | -18.39 | 6.43 |
And now look at a plot of the summaries
brier_auc() functionTo use this function properly, you must:
Be sure you’re using it with the correct model type, which in
this case is only a logistic regression with a binary response variable.
Additionally, logistic regression models fit using the R packages
glmmTMB, lme4, and
glm/lm from the stats package are
all supported.
The only other choices you need to make are:
The number of bootstrap or Monte Carlo replicates you want to use
(the default nReps = 100 but this should be at least 500 or
1000).
For method = "holdout" only: The proportion of data
you want to use for assessing in-sample predictive performance (the
default is propTrain = 0.8).
Whether you want to use the DHARMa package’s
simulateResiduals() function to assess and plot the model’s
residuals (the default is DHARMaPlot=TRUE) and how many
simulation replicates you want DHARMa to use (the default
is DHARMaReps=1000). There is no option to test for
zero-inflation with this function as this is not strictly relevant to
logistic regression models with binary responses.
br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")Look at the first few fit statistics…
| simRep | Group | Metric | value |
|---|---|---|---|
| 1 | In-sample performance | AUC statistic | 0.76 |
| 1 | In-sample performance | Brier score | 0.15 |
| 1 | In-sample performance | Log loss | 0.46 |
| 1 | Out-of-sample performance | AUC statistic | 0.70 |
| 1 | Out-of-sample performance | Brier score | 0.14 |
| 1 | Out-of-sample performance | Log loss | 0.43 |
| 2 | In-sample performance | AUC statistic | 0.77 |
| 2 | In-sample performance | Brier score | 0.14 |
| 2 | In-sample performance | Log loss | 0.45 |
And then look at a summary of the nReps values…
| Group | Metric | mn | lwr95 | upr95 |
|---|---|---|---|---|
| In-sample performance | AUC statistic | 0.75 | 0.74 | 0.77 |
| In-sample performance | Brier score | 0.14 | 0.14 | 0.15 |
| In-sample performance | Log loss | 0.45 | 0.43 | 0.46 |
| Out-of-sample performance | AUC statistic | 0.74 | 0.66 | 0.80 |
| Out-of-sample performance | Brier score | 0.15 | 0.12 | 0.18 |
| Out-of-sample performance | Log loss | 0.46 | 0.40 | 0.56 |
| Null model (baseline) | AUC statistic | 0.50 | NA | NA |
| Null model (baseline) | Brier score | 0.17 | NA | NA |
| Null model (baseline) | Log loss | 0.52 | NA | NA |
And now look at a plot of the summaries
Let’s again compare to the bootstrap method, which
generally should be very similar for well-behaved models and reasonably
sized data sets.
br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")Summary of bootstrap method
| Group | Metric | mn | lwr95 | upr95 |
|---|---|---|---|---|
| In-sample performance | AUC statistic | 0.75 | 0.72 | 0.79 |
| In-sample performance | Brier score | 0.14 | 0.13 | 0.16 |
| In-sample performance | Log loss | 0.45 | 0.42 | 0.49 |
| Out-of-sample performance | AUC statistic | 0.74 | 0.68 | 0.79 |
| Out-of-sample performance | Brier score | 0.15 | 0.13 | 0.17 |
| Out-of-sample performance | Log loss | 0.46 | 0.41 | 0.52 |
| Null model (baseline) | AUC statistic | 0.50 | NA | NA |
| Null model (baseline) | Brier score | 0.17 | NA | NA |
| Null model (baseline) | Log loss | 0.52 | NA | NA |
And now look at a plot of the summaries
When a GLMM includes random effects and a nonlinear link function (e.g., log or logit), back-transforming population-level (marginal) predictions to the response scale introduces a systematic bias. This occurs because the exponential function curves upward – averaging on the log scale and then back-transforming always gives a smaller value than back-transforming individual predictions and then averaging (e.g., a bowl-shaped or exponential growth curve). This property is known as Jensen’s inequality. For a log-link model with random effects, this means:
\[E[\exp(\beta_0 + u)] = \exp(\beta_0) \cdot \exp(\sigma^2/2)\]
where \(\sigma^2\) is the total random effect variance. In other words, \(\exp(\beta_0)\) , the typical marginal prediction, underestimates the true arithmetic mean by a factor of \(\exp(\sigma^2/2)\).
This bias is directly analogous to the well-known log-normal back-transformation bias. The math is identical; the only difference is where the log-scale variance comes from:
| Log-normal model | Log-link GLMM | |
|---|---|---|
| Model structure | \(\log(Y) \sim \text{Normal}(\mu, \sigma^2)\) | \(\log(\mu_{ij}) = X\beta + u_j\), \(u_j \sim \text{Normal}(0, \sigma^2)\) |
| Log-scale variance | Residual \(\sigma^2\) | Random effect variance \(\sigma^2\) |
| Naive back-transform | \(\exp(\mu)\) | \(\exp(X\beta)\) |
| True arithmetic mean | \(\exp(\mu + \sigma^2/2)\) | \(\exp(X\beta + \sigma^2/2)\) |
| Correction factor | \(\exp(\sigma^2/2)\) | \(\exp(\sigma^2/2)\) |
In the log-normal case, the log-scale variance is the residual variance and the correction is widely known. In the log-link GLMM case, the log-scale variance is the random effect variance and the same correction applies; however, this connection seems to be rarely made explicit in ecological statistics training. A fixed-effects GLM with a log link has no random component on the linear predictor and therefore no Jensen bias; \(\exp(\beta_0)\) is exactly the arithmetic mean.
This bias grows with random effect variance, is invisible to standard diagnostics (DHARMa, AIC, residual checks), and although it does not affect fixed-effect inference or contrasts, it can substantially affect response-scale predictions used for management.
For models with multiple random effects, the variances are additive on the log scale:
\[\text{correction} = \exp\left(\frac{\sigma^2_1 + \sigma^2_2 + \cdots + \sigma^2_k}{2}\right)\]
This is not a model deficiency: it is a structural
property of GLMMs. Consistent negative RBIAS in
bias_precision() output signals that this heterogeneity is
being ignored in marginal predictions. For purely inferential purposes
(contrasts, p-values, effect directions) this does not matter. For
calibrated response-scale predictions (abundance maps, density
estimates, management outputs), the correction should be applied.
Whether Jensen bias is consequential for a given analysis depends critically on the nature of the random effects in the model and the purpose of the predictions. Users should consider two broad categories:
Substantive random effects represent real, ecologically meaningful group-level variation that is relevant to prediction. For these, Jensen bias matters for response-scale predictions because marginalizing over the RE loses real information about the quantity being predicted:
Nuisance random effects capture study-specific variation that is not relevant to forward predictions or generalization. For these, marginalizing over the RE is conceptually correct and the Jensen bias is largely irrelevant:
The general principle is: if the random effect represents
variation you would want to condition on in a prediction, the Jensen
bias matters. If it represents study-specific noise that should not
carry forward, marginalizing over it is correct and the Jensen bias is a
non-issue. Whether to apply the bias correction is therefore a
scientific judgment call that depends on the nature of the random
effects, the purpose of the predictions, and the magnitude of the bias
given the estimated RE variance. bias_precision() surfaces
the bias and provides tools to quantify and correct it; the decision of
whether to act on it rests with the user.
bias_precision() provides a bias_adjust
argument with three options for diagnosing and correcting Jensen’s
inequality bias in glmmTMB models. A message is
automatically triggered when both in-sample and out-of-sample RBIAS are
consistently negative (< -10%), prompting the user to investigate.
The recommended diagnostic workflow is:
Step 1 - Run with defaults
(bias_adjust = "none") and inspect RBIAS. A consistently
negative RBIAS in both in-sample and out-of-sample performance suggests
Jensen’s inequality may be contributing.
# Step 1: default marginal predictions, Jensen signal visible
chk_none <- bias_precision(nReps = 100, testModel = countModel_GLMM,
testData = countData, method = "holdout",
DHARMaPlot = FALSE, bias_adjust = "none", seed = 123)
chk_none$bias_precision_summaryStep 2 - Re-run with
bias_adjust = "manual". If RBIAS moves toward zero,
Jensen’s inequality is confirmed as the source of the negative bias
rather than model misspecification or poor generalization. The
analytical correction \(\exp(\sigma^2/2)\) is applied to marginal
predictions using the total random effect variance from
VarCorr().
# Step 2: analytical lognormal correction
chk_manual <- bias_precision(nReps = 100, testModel = countModel_GLMM,
testData = countData, method = "holdout",
DHARMaPlot = FALSE, bias_adjust = "manual", seed = 123)
chk_manual$bias_precision_summaryStep 3 (optional) - Re-run with
bias_adjust = "tmb" to validate the manual correction using
TMB’s automatic differentiation. If manual and TMB agree closely, the
analytical correction is reliable for your RE structure. Note that this
step is considerably slower than Step 2.
# Step 3 (optional): TMB validation
chk_tmb <- bias_precision(nReps = 100, testModel = countModel_GLMM,
testData = countData, method = "holdout",
DHARMaPlot = FALSE, bias_adjust = "tmb", seed = 123)
chk_tmb$bias_precision_summaryOnce Jensen’s inequality has been diagnosed as the source of negative
RBIAS, the correction can be applied directly to response-scale
predictions outside of GLAMMGoF. Note that although the calculations are
explicitly written out below, GLAMMGoF has a
jensen_correct() which provides a convenient wrapper for
this calculation, accepting predictions, standard errors, and confidence
interval bounds on the link or response scale (see ?jensen_correct for
details):
# Manual lognormal correction (fast, reliable for random intercept models)
vc <- VarCorr(countModel_GLMM)$cond
correction <- exp(sum(sapply(vc, function(x) x[1, 1])) / 2)
preds_adjusted <- predict(countModel_GLMM, newdata = countData,
type = "response", re.form = ~0) * correction
# Standard errors for manually corrected predictions should also be scaled
# by the same correction factor, since it is a multiplicative constant.
# The delta method gives: SE_corrected = SE_original * correction
preds_se <- predict(countModel_GLMM, newdata = countData,
type = "response", re.form = ~0, se.fit = TRUE)$se.fit
preds_se_adjusted <- preds_se * correction
# TMB correction (handles random slopes and complex RE structures).
# Unlike the manual correction, TMB automatically returns bias-corrected
# standard errors alongside point estimates via automatic differentiation --
# no additional scaling is needed.
preds_tmb <- predict(countModel_GLMM, newdata = countData,
type = "response", re.form = ~0, do.bias.correct = TRUE)
# preds_tmb is a matrix with columns:
# "Estimate" -- uncorrected marginal prediction
# "Std. Error" -- SE of uncorrected prediction
# "Est. (bias.correct)" -- bias-corrected prediction
# "Std. (bias.correct)" -- SE of bias-corrected prediction (use this one)
preds_corrected <- preds_tmb[, "Est. (bias.correct)"]
preds_se_corrected <- preds_tmb[, "Std. (bias.correct)"]Note that bias_adjust = "manual" is not supported for
binomial models in brier_auc() since the analytical
log-normal correction is only valid for log-link models. Use
bias_adjust = "tmb" instead for binomial GLMMs.
conditional_predictions argumentBy default, bias_precision() and
brier_auc() use population-level (marginal) predictions for
both in-sample and out-of-sample performance, setting random effects to
zero (re.form = ~0). This assesses how well the model
generalizes to new observations where group-level context is unavailable
- the most relevant question when predicting to new sites, new years, or
new individuals.
Setting conditional_predictions = TRUE switches both
in-sample and out-of-sample predictions to condition on estimated random
effects (re.form = NULL). Since GLAMMGoF’s holdout CV
splits rows within groups rather than groups themselves, random effect
estimates from the training model are valid for held-out observations
from the same groups. In-sample and out-of-sample metrics remain
directly comparable since both use the same prediction strategy; the gap
between them reflects genuine row-level over-fitting rather than any
marginal vs conditional distinction.
bias_adjust and
conditional_predictions combinationsThe table below summarizes the behavior of each combination of
bias_adjust and conditional_predictions in
bias_precision(). Note that
bias_adjust = "tmb" uses re.form = NULL as a
computational necessity so that TMB’s automatic differentiation can
access the RE variance structure; this is not a choice to make
group-specific conditional predictions, and the output is still a
corrected population-level marginal mean. This means
conditional_predictions has no additional effect when
bias_adjust = "tmb"; both settings produce the same
TMB-corrected behavior. There are five distinct behaviors in total; the
combination marked ❌ is blocked with an informative error.
bias_adjust |
conditional_predictions |
Prediction type | What it measures |
|---|---|---|---|
"none" (default) |
FALSE (default) |
Marginal | Population-level (i.e., ignore REs) generalization; Jensen signal visible in RBIAS |
"manual" |
FALSE |
Marginal + corrected | Population-level (i.e., ignore REs) generalization with analytical Jensen correction |
"tmb" |
FALSE or TRUE |
Corrected population-level | Corrected population-level marginal mean via TMB’s AD;
uses re.form = NULL internally as a computational
requirement, not to obtain group-specific predictions |
"none" |
TRUE |
Conditional | Within-group (i.e., predict to known RE groupings) accuracy; Jensen bias absorbed by REs |
"manual" |
TRUE |
❌ Error | Double-correction - not permitted |
For brier_auc(), the same logic applies with the
exception that bias_adjust = "manual" is not available for
binomial models regardless of conditional_predictions.
The default combination (bias_adjust = "none",
conditional_predictions = FALSE) is recommended as the
starting point for all analyses. It provides the most diagnostic
information and surfaces Jensen’s inequality bias when present. The
other combinations are best used in sequence as part of the step-wise
diagnostic workflow described above. Note that there is currently no
option to obtain truly marginal, TMB-corrected predictions, since the
TMB correction is mathematically tied to conditional prediction.
Based on the foregoing, we recommend the following general workflow
for log-link GLMMs with random effects. Use marginal predictions
(re.form = ~0) for all inferential purposes; fixed-effect
contrasts, effect directions, p-values, and model comparison are
unaffected by Jensen’s inequality bias. For response-scale quantities
intended for reporting, figures, or management; predicted abundances,
density estimates, and covariate effect plots; apply the lognormal bias
correction via jensen_correct() or
bias_adjust = "manual" in bias_precision().
This two-stage approach separates the inferential and predictive roles
of the model and ensures that response-scale outputs are calibrated to
the arithmetic mean of the population rather than the geometric
mean.
bias_precision()For obvious reasons, the code below will cause
bias_precision quit and throw an error message:
bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")Error in bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, :
Response variable is binary! Use brier_auc() instead
brier_auc()For equally obvious reasons, the code below will cause
brier_auc quit and throw an error message:
brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")Error in brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, :
Response variable is not binary! Use bias_precision() instead