GLAMMGoF Tutorial

What does GLAMMGoF do?

GLAMMGoF is an R package for resampling-based predictive assessment of generalized linear and generalized additive models. The two main functions in the package, bias_precision() and brier_auc(), quantify in-sample and out-of-sample predictive performance by repeatedly refitting a model structure to re-sampled training data and evaluating predictions on held-out testing data. These functions support model objects from a variety of R packages, including glmmTMB() from glmmTMB, lmer(), glmer(), and glmer.nb() from lme4, glm.nb() from MASS, gam() from mgcv, and glm() and lm() from the stats package. Optional residual diagnostics can also be performed using the DHARMa package.

bias_precision() is intended for models with continuous or integer response variables (e.g., Gaussian, Poisson, negative binomial, Tweedie, or Gamma distributions) and returns relative root mean squared error (RRMSE), relative mean absolute error (RMAE), relative median absolute error (RMedAE), and relative bias (RBIAS).
brier_auc() is intended for models with binary response variables (i.e., logistic regression) and returns AUC, Brier score, and log loss.

Both functions accept models with or without random effects. By default, predictive performance is based on population-level (marginal) predictions, meaning random effects are excluded during prediction (re.form = ~0). This behavior can be changed via the conditional_predictions argument (see the Jensen’s inequality section below for details).

Two resampling methods are available via the method argument: repeated random holdout (method = “holdout”, the default), which repeatedly partitions the data into training and testing subsets and is intuitive and widely used; and bootstrap resampling with out-of-bag evaluation (method = “bootstrap”), which samples with replacement and tends to make more efficient use of the available data, particularly for smaller data sets.

Important note: The functions in this package take an existing model fitted to the full data set and assess the predictive performance of that model structure through repeated resampling. In each replicate, the model is refit to a re-sampled training data set and evaluated on separate testing data. This workflow is intentional: fitting the final model to all available data maximizes information available for inference, while resampling-based predictive assessment provides a practical way to evaluate generalization without permanently withholding data. This approach is especially useful in ecological applications where sample sizes are often limited.

In general, method = "holdout" more closely reflects prediction to independent unseen observations and may be more intuitive for assessing generalization, whereas method = "bootstrap" makes more efficient use of the available data by evaluating performance on out-of-bag observations and may provide more stable estimates of predictive performance when sample sizes are small.

Getting started

You’ve either downloaded and installed GLAMMGoF from R-universe, GitHub, DevOps, or a local zipped package file. Once installed, you can access the vignette by running browseVignettes("GLAMMGoF"). Note that ggplot2 is a dependency, and functions from numerous other R packages are imported and used as well.

library(GLAMMGoF)

How is GLAMMGoF used?

There are only two functions in this package, and which one you use depends on the type of model being assessed:

bias_precision(): used for models with continuous or integer response variables and estimates four predictive performance metrics using repeated resampling: RRMSE, RMAE, RMedAE, and RBIAS. For all four metrics, values closer to 0 indicate better predictive performance (see ?bias_precision for details). All three accuracy statistics (RRMSE, RMAE, and RMedAE) express prediction error as a percentage of the mean observed value, which places them on a common scale regardless of the units or magnitude of the response. To recover the corresponding raw error in the original units of the response, multiply any of these values by the observed mean and divide by 100. For example, an RRMSE of 18% with a sample mean of 50 implies a root mean squared error of 9 in the original units.

RRMSE (Relative Root Mean Squared Error): The square root of average squared prediction errors, expressed as a percentage of the mean observed value. Because errors are squared before averaging, RRMSE penalizes large individual errors more heavily than small ones, making it sensitive to cases where the model produces occasional extreme predictions.
```
fit_cost_rrmse <- function(y, yhat) sqrt(mean((yhat - y)^2)) / mean(y) * 100
```
RMAE (Relative Mean Absolute Error): The average absolute prediction error expressed as a percentage of the mean observed value. RMAE treats all errors proportionally to their size without the extra weight that squaring applies, and is often the most intuitive summary of typical prediction accuracy.
```
fit_cost_rmae  <- function(y, yhat) mean(abs(yhat - y)) / mean(y) * 100
```
RMedAE (Relative Median Absolute Error): The median absolute prediction error expressed as a percentage of the mean observed value. Because it is based on the median rather than the mean of the error distribution, RMedAE is the most robust of the three to outlying predictions and gives the best picture of accuracy for a typical observation when the error distribution is skewed.
```
fit_cost_rmedae  <- function(y, yhat) median(abs(yhat - y)) / mean(y) * 100
```
- Taken together, these three statistics tell a more complete story than any one alone. If RRMSE is notably larger than RMAE, that signals a handful of high-leverage predictions are inflating the squared-error average, which is a pattern RMedAE will often fail to reflect if the bulk of predictions are accurate. Conversely, close agreement among all three suggests errors are roughly symmetric and there are no extreme outliers driving the summary.
RBIAS (Relative Bias): The mean signed prediction error expressed as a percentage of the mean observed value, where positive values indicate systematic over-prediction and negative values indicate systematic under-prediction. RBIAS is independent of the accuracy metrics above: a model can be nearly unbiased on average yet still produce large errors, or it can be highly biased while still ranking observations correctly. Reporting RBIAS alongside RRMSE and RMAE therefore distinguishes random prediction noise from systematic directional error.
```
fit_cost_rbias <- function(y, yhat) mean(yhat - y) / mean(y) * 100
```

brier_auc(): used for models with binary response variables and estimates three predictive performance metrics using repeated resampling: AUC, Brier score, and log loss. AUC values closer to 1 (bounded between 0 and 1) indicate better classification performance, whereas Brier score (bounded between 0 and 1) and log loss (lower bound of 0, no upper bound) values closer to 0 indicate better probabilistic predictions (see ?brier_auc for details).

The three binary metrics each capture a different aspect of predictive performance and are most informative when interpreted together.
AUC (Area Under the ROC Curve): The probability that a randomly chosen positive case receives a higher predicted probability than a randomly chosen negative case, ranging from 0 (no discrimination) to 1 (perfect discrimination), and where an AUC of 0.5 indicates that a model predicts no better than random chance (and AUC <0.5 is worse than random). AUC is purely a measure of discrimination (the model’s ability to rank cases by their probability of the outcome) and is insensitive to how well the predicted probabilities are calibrated in absolute terms. This function also returns the null model (i.e., intercept-only) AUC statistic, which is assumed to be 0.5, which allows for comparison to the summarized AUC metrics.
Brier Score: Across all observations, the mean squared difference between predicted probabilities and observed binary outcomes (0/1), ranging from 0 (perfect) to 1 (worst). The Brier score jointly rewards good discrimination and good calibration, penalizing confident but wrong predictions more heavily than uncertain ones. A model can have high AUC but a poor Brier score if its predicted probabilities are systematically too high or too low even while correctly ordering cases. This function also returns the null model (i.e., intercept-only) Brier score for reference (i.e., the best predictive performance achievable without any predictors), computed as p_bar * (1 - p_bar), where p_bar is the observed event rate, which allows for comparison to the summarized Brier score metrics. Note that the null Brier score is based on the arithmetic mean of the response, which approximates but is not identical to plogis(intercept) from a fitted intercept-only logistic model; the difference is negligible in practice and avoids the overhead of model fitting.
Log Loss (Binary Cross-Entropy): Across all observations, the mean negative log-likelihood of the observed outcomes under the predicted probabilities, where lower values indicate better performance. Like the Brier score, log loss rewards calibration, but its logarithmic scale means it penalizes extreme, confidently wrong predictions far more aggressively. Log loss is therefore the most sensitive of the three to cases where the model assigns near-zero probability to an outcome that actually occurs. This function also returns the null model (i.e., intercept-only) log loss for reference (i.e., the best predictive performance achievable without any predictors), computed as -(p_bar * log(p_bar) + (1 - p_bar) * log(1 - p_bar)), where p_bar is the observed event rate, which allows for comparison to the summarized log loss metrics. Note that the null log loss is based on the arithmetic mean of the response, which approximates but is not identical to plogis(intercept) from a fitted intercept-only logistic model; the difference is negligible in practice and avoids the overhead of model fitting. Log loss is computed internally as:

    log_loss <- function(y, p, eps = 1e-15) {p <- pmax(pmin(p, 1 - eps), eps) -mean(y * log(p) + (1 - y) * log(1 - p))}

Reporting all three together is recommended: AUC reflects whether the model correctly separates cases from non-cases; Brier score provides an overall summary of probability accuracy on a squared-error scale; and log loss highlights whether the model is making any dangerously overconfident errors that the Brier score might underweight.

Both functions assess in- and out-of-sample predictive performance for generalized linear and generalized additive models, with or without random effects, by repeatedly resampling data and evaluating predictive performance across multiple replicates. Results are stored across all replicates and reported as a summary table (mean and 95% confidence intervals) and a histogram of replicate values for each performance statistic, with a blue dotted vertical line indicating the mean across replicates. For brier_auc(), a red dashed vertical line is also displayed in the log loss panels, indicating the null (intercept-only) log loss baseline as a reference point for model comparison.

Two resampling methods are available via the method argument. The default (method = "holdout") uses repeated random holdout (Monte Carlo cross-validation), which splits the data into a training set (80% by default, controlled by propTrain) and a test set (the remaining 20%). The model is refit to the training data and performance is evaluated on both the training data (in-sample performance) and the withheld test data (out-of-sample performance). This method is intuitive and widely understood, but the fixed training/test split means the training set is always smaller than the full data set, which can be a limitation for smaller data sets.

Alternatively, method = "bootstrap" uses bootstrap resampling with replacement, where each bootstrap training data set contains the same number of observations as the original data set, sampled with replacement. Out-of-sample performance is evaluated on the out-of-bag observations not selected in the bootstrap sample (approximately 36.8% of observations on average). This method makes more efficient use of the available data and tends to produce more stable estimates, particularly for smaller data sets, but is slightly less intuitive than the holdout approach.

For well-behaved models and reasonably sized data sets, both methods should produce similar results. Differences are most likely to emerge with small data sets, highly overdispersed data, or poorly specified models.

For zero-heavy count or continuous data, the choice of performance metric matters. RBIAS in particular can behave unexpectedly when the response contains zeros or values near zero, because the relative error formulation divides by the observed value, producing unstable or extreme terms that can dominate the average and give a misleading impression of systematic underprediction. This is a known limitation of relative bias metrics rather than evidence of poor model fit (Hyndman and Koehler 2006), and is most likely to occur when zeros are common but the zero-generating process is not explicitly modeled.

In these situations, RMedAE or RMAE may provide more reliable summaries of predictive performance. RMedAE is robust to extreme errors because it is based on the median absolute error rather than the mean, and therefore better reflects predictive accuracy for a typical observation. RMAE can also be useful because absolute-error metrics weight errors linearly rather than quadratically, making them less sensitive to extreme observations than RMSE-based measures.
When combined with the DHARMa zero-inflation test (testZI = TRUE), these metrics provide a more complete assessment than any single measure alone. RMedAE or RMAE summarize predictive accuracy, while the ZI test evaluates whether the model correctly reproduces the observed frequency of zeros. A significant ZI test result with ratioObsSim > 1 indicates that the model is underpredicting zeros, suggesting that a zero-inflated model family or additional sources of heterogeneity should be considered.

This entire process is repeated nReps times, and the resulting sampling distributions of the performance metrics are summarized. Because RRMSE, RMAE, RMedAE, RBIAS, AUC, Brier score, and log loss quantify different aspects of predictive performance, they should be interpreted as complementary rather than interchangeable

There are two example data sets, countData and logitData, with integer and binary response variables, respectively. Each data set also includes a variety of continuous and categorical predictor variables. Additionally, there are 12 example model objects (6 beginning with count and 6 beginning with logit): Model_GLM, Model_GLMM, Model_GLMM2, Model_GAM, Model_GAMM, and Model_GAMM2. Models ending in M include only fixed effects, whereas models ending in MM and MM2 include one and two random effects, respectively. All 12 example models were fitted using glmmTMB, but GLAMMGoF supports models fitted using a variety of compatible R packages

Using the `bias_precision()` function

To use this function properly, you must:

Be sure you’re using it with the correct model type, which in this case is one with a continuous or integer response variable. Here, a variety of distributions (e.g., Gaussian, Poisson, negative binomial, Tweedie, and gamma) are supported, as are numerous R packages (glmmTMB, lme4 (including glmer, glmer.nb, and lmer), MASS (glm.nb), mgcv (gam), and glm/lm from the stats package).
The only other choices you need to make are:
1. The number of bootstrap or Monte Carlo replicates you want to use (the default is nReps=100 but this should be at least 500 or 1000).
2. For method = "holdout" only: The proportion of data you want to use for assessing in-sample predictive performance (the default is 0.8).
3. Whether you want to use the DHARMa package’s simulateResiduals() function to assess and plot the model’s residuals (the default is DHARMaPlot=TRUE), whether you also want DHARMa's to test for zero inflation using the testZeroInflation() function (the default is testZI=TRUE), and how many simulation replicates you want DHARMa to use (the default is DHARMaReps=1000).

Now let’s run it and see what happens…

bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Look at the first few fit statistics…

simRep	Group	Metric	value
1	In-sample performance	RRMSE	105.96
1	Out-of-sample performance	RRMSE	98.81
1	In-sample performance	RMedAE	47.52
1	Out-of-sample performance	RMedAE	48.30
1	In-sample performance	RMAE	69.30
1	Out-of-sample performance	RMAE	68.08
1	In-sample performance	RBIAS	-4.76
1	Out-of-sample performance	RBIAS	-0.79

And then look at a summary of the nReps values…

Group	Metric	mn	lwr95	upr95
In-sample performance	RRMSE	104.83	101.13	107.46
In-sample performance	RMAE	68.92	67.68	70.10
In-sample performance	RMedAE	47.42	45.76	48.98
In-sample performance	RBIAS	-4.77	-6.16	-3.22
Out-of-sample performance	RRMSE	104.25	93.90	118.32
Out-of-sample performance	RMAE	69.17	64.06	76.01
Out-of-sample performance	RMedAE	47.44	41.29	54.63
Out-of-sample performance	RBIAS	-4.27	-16.88	11.77

And now look at a plot of the summaries

Now let’s compare to the bootstrap method, which generally should be very similar for well-behaved models and reasonably sized data sets.

bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")

Summary of bootstrap method

Group	Metric	mn	lwr95	upr95
In-sample performance	RRMSE	104.68	96.95	112.63
In-sample performance	RMAE	68.68	65.40	72.17
In-sample performance	RMedAE	47.24	44.61	50.68
In-sample performance	RBIAS	-5.13	-7.97	-2.41
Out-of-sample performance	RRMSE	104.66	93.40	113.78
Out-of-sample performance	RMAE	69.33	65.18	73.82
Out-of-sample performance	RMedAE	47.64	41.25	53.65
Out-of-sample performance	RBIAS	-4.52	-18.39	6.43

And now look at a plot of the summaries

Using the `brier_auc()` function

To use this function properly, you must:

Be sure you’re using it with the correct model type, which in this case is only a logistic regression with a binary response variable. Additionally, logistic regression models fit using the R packages glmmTMB, lme4, and glm/lm from the stats package are all supported.
The only other choices you need to make are:
1. The number of bootstrap or Monte Carlo replicates you want to use (the default nReps = 100 but this should be at least 500 or 1000).
2. For method = "holdout" only: The proportion of data you want to use for assessing in-sample predictive performance (the default is propTrain = 0.8).
3. Whether you want to use the DHARMa package’s simulateResiduals() function to assess and plot the model’s residuals (the default is DHARMaPlot=TRUE) and how many simulation replicates you want DHARMa to use (the default is DHARMaReps=1000). There is no option to test for zero-inflation with this function as this is not strictly relevant to logistic regression models with binary responses.

br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Look at the first few fit statistics…

simRep	Group	Metric	value
1	In-sample performance	AUC statistic	0.76
1	In-sample performance	Brier score	0.15
1	In-sample performance	Log loss	0.46
1	Out-of-sample performance	AUC statistic	0.70
1	Out-of-sample performance	Brier score	0.14
1	Out-of-sample performance	Log loss	0.43
2	In-sample performance	AUC statistic	0.77
2	In-sample performance	Brier score	0.14
2	In-sample performance	Log loss	0.45

And then look at a summary of the nReps values…

Group	Metric	mn	lwr95	upr95
In-sample performance	AUC statistic	0.75	0.74	0.77
In-sample performance	Brier score	0.14	0.14	0.15
In-sample performance	Log loss	0.45	0.43	0.46
Out-of-sample performance	AUC statistic	0.74	0.66	0.80
Out-of-sample performance	Brier score	0.15	0.12	0.18
Out-of-sample performance	Log loss	0.46	0.40	0.56
Null model (baseline)	AUC statistic	0.50	NA	NA
Null model (baseline)	Brier score	0.17	NA	NA
Null model (baseline)	Log loss	0.52	NA	NA

And now look at a plot of the summaries

Let’s again compare to the bootstrap method, which generally should be very similar for well-behaved models and reasonably sized data sets.

br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")

Summary of bootstrap method

Group	Metric	mn	lwr95	upr95
In-sample performance	AUC statistic	0.75	0.72	0.79
In-sample performance	Brier score	0.14	0.13	0.16
In-sample performance	Log loss	0.45	0.42	0.49
Out-of-sample performance	AUC statistic	0.74	0.68	0.79
Out-of-sample performance	Brier score	0.15	0.13	0.17
Out-of-sample performance	Log loss	0.46	0.41	0.52
Null model (baseline)	AUC statistic	0.50	NA	NA
Null model (baseline)	Brier score	0.17	NA	NA
Null model (baseline)	Log loss	0.52	NA	NA

And now look at a plot of the summaries

Jensen’s inequality, random effects, and marginal predictions

When a GLMM includes random effects and a nonlinear link function (e.g., log or logit), back-transforming population-level (marginal) predictions to the response scale introduces a systematic bias. This occurs because the exponential function curves upward – averaging on the log scale and then back-transforming always gives a smaller value than back-transforming individual predictions and then averaging (e.g., a bowl-shaped or exponential growth curve). This property is known as Jensen’s inequality. For a log-link model with random effects, this means:

\[E[\exp(\beta_0 + u)] = \exp(\beta_0) \cdot \exp(\sigma^2/2)\]

where \(\sigma^2\) is the total random effect variance. In other words, \(\exp(\beta_0)\) , the typical marginal prediction, underestimates the true arithmetic mean by a factor of \(\exp(\sigma^2/2)\).

This bias is directly analogous to the well-known log-normal back-transformation bias. The math is identical; the only difference is where the log-scale variance comes from:

	Log-normal model	Log-link GLMM
Model structure	\(\log(Y) \sim \text{Normal}(\mu, \sigma^2)\)	\(\log(\mu_{ij}) = X\beta + u_j\), \(u_j \sim \text{Normal}(0, \sigma^2)\)
Log-scale variance	Residual \(\sigma^2\)	Random effect variance \(\sigma^2\)
Naive back-transform	\(\exp(\mu)\)	\(\exp(X\beta)\)
True arithmetic mean	\(\exp(\mu + \sigma^2/2)\)	\(\exp(X\beta + \sigma^2/2)\)
Correction factor	\(\exp(\sigma^2/2)\)	\(\exp(\sigma^2/2)\)

In the log-normal case, the log-scale variance is the residual variance and the correction is widely known. In the log-link GLMM case, the log-scale variance is the random effect variance and the same correction applies; however, this connection seems to be rarely made explicit in ecological statistics training. A fixed-effects GLM with a log link has no random component on the linear predictor and therefore no Jensen bias; \(\exp(\beta_0)\) is exactly the arithmetic mean.

This bias grows with random effect variance, is invisible to standard diagnostics (DHARMa, AIC, residual checks), and although it does not affect fixed-effect inference or contrasts, it can substantially affect response-scale predictions used for management.

For models with multiple random effects, the variances are additive on the log scale:

\[\text{correction} = \exp\left(\frac{\sigma^2_1 + \sigma^2_2 + \cdots + \sigma^2_k}{2}\right)\]

This is not a model deficiency: it is a structural property of GLMMs. Consistent negative RBIAS in bias_precision() output signals that this heterogeneity is being ignored in marginal predictions. For purely inferential purposes (contrasts, p-values, effect directions) this does not matter. For calibrated response-scale predictions (abundance maps, density estimates, management outputs), the correction should be applied.

When does Jensen bias matter? Substantive vs nuisance random effects

Whether Jensen bias is consequential for a given analysis depends critically on the nature of the random effects in the model and the purpose of the predictions. Users should consider two broad categories:

Substantive random effects represent real, ecologically meaningful group-level variation that is relevant to prediction. For these, Jensen bias matters for response-scale predictions because marginalizing over the RE loses real information about the quantity being predicted:

Site random effects when predicting abundance or density at specific locations
Spatial random effects when generating abundance maps or occupancy surfaces
Individual random effects in mark-recapture models when individual-level predictions are of interest
Population random effects in hierarchical models where population identity carries biological meaning

Nuisance random effects capture study-specific variation that is not relevant to forward predictions or generalization. For these, marginalizing over the RE is conceptually correct and the Jensen bias is largely irrelevant:

Year random effects capturing inte-rannual variation during a study period: not meaningful when predicting to future years or simulating counts under specified covariate conditions
Observer random effects capturing individual variation in detection: not relevant when predicting to new observers
Overdispersion random effects (OLREs): capture residual heterogeneity not attributable to any substantive grouping factor
Batch or run random effects in laboratory studies: study-specific, not generalizable

The general principle is: if the random effect represents variation you would want to condition on in a prediction, the Jensen bias matters. If it represents study-specific noise that should not carry forward, marginalizing over it is correct and the Jensen bias is a non-issue. Whether to apply the bias correction is therefore a scientific judgment call that depends on the nature of the random effects, the purpose of the predictions, and the magnitude of the bias given the estimated RE variance. bias_precision() surfaces the bias and provides tools to quantify and correct it; the decision of whether to act on it rests with the user.

Diagnosing and correcting Jensen’s inequality bias

bias_precision() provides a bias_adjust argument with three options for diagnosing and correcting Jensen’s inequality bias in glmmTMB models. A message is automatically triggered when both in-sample and out-of-sample RBIAS are consistently negative (< -10%), prompting the user to investigate. The recommended diagnostic workflow is:

Step 1 - Run with defaults (bias_adjust = "none") and inspect RBIAS. A consistently negative RBIAS in both in-sample and out-of-sample performance suggests Jensen’s inequality may be contributing.

# Step 1: default marginal predictions, Jensen signal visible
chk_none <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                           testData = countData, method = "holdout",
                           DHARMaPlot = FALSE, bias_adjust = "none", seed = 123)
chk_none$bias_precision_summary

Step 2 - Re-run with bias_adjust = "manual". If RBIAS moves toward zero, Jensen’s inequality is confirmed as the source of the negative bias rather than model misspecification or poor generalization. The analytical correction \(\exp(\sigma^2/2)\) is applied to marginal predictions using the total random effect variance from VarCorr().

# Step 2: analytical lognormal correction
chk_manual <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                             testData = countData, method = "holdout",
                             DHARMaPlot = FALSE, bias_adjust = "manual", seed = 123)
chk_manual$bias_precision_summary

Step 3 (optional) - Re-run with bias_adjust = "tmb" to validate the manual correction using TMB’s automatic differentiation. If manual and TMB agree closely, the analytical correction is reliable for your RE structure. Note that this step is considerably slower than Step 2.

# Step 3 (optional): TMB validation
chk_tmb <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                          testData = countData, method = "holdout",
                          DHARMaPlot = FALSE, bias_adjust = "tmb", seed = 123)
chk_tmb$bias_precision_summary

Once Jensen’s inequality has been diagnosed as the source of negative RBIAS, the correction can be applied directly to response-scale predictions outside of GLAMMGoF. Note that although the calculations are explicitly written out below, GLAMMGoF has a jensen_correct() which provides a convenient wrapper for this calculation, accepting predictions, standard errors, and confidence interval bounds on the link or response scale (see ?jensen_correct for details):

# Manual lognormal correction (fast, reliable for random intercept models)
vc         <- VarCorr(countModel_GLMM)$cond
correction <- exp(sum(sapply(vc, function(x) x[1, 1])) / 2)
preds_adjusted <- predict(countModel_GLMM, newdata = countData,
                          type = "response", re.form = ~0) * correction

# Standard errors for manually corrected predictions should also be scaled
# by the same correction factor, since it is a multiplicative constant.
# The delta method gives: SE_corrected = SE_original * correction
preds_se   <- predict(countModel_GLMM, newdata = countData,
                      type = "response", re.form = ~0, se.fit = TRUE)$se.fit
preds_se_adjusted <- preds_se * correction

# TMB correction (handles random slopes and complex RE structures).
# Unlike the manual correction, TMB automatically returns bias-corrected
# standard errors alongside point estimates via automatic differentiation --
# no additional scaling is needed.
preds_tmb  <- predict(countModel_GLMM, newdata = countData,
                      type = "response", re.form = ~0, do.bias.correct = TRUE)
# preds_tmb is a matrix with columns:
#   "Estimate"            -- uncorrected marginal prediction
#   "Std. Error"          -- SE of uncorrected prediction
#   "Est. (bias.correct)" -- bias-corrected prediction
#   "Std. (bias.correct)" -- SE of bias-corrected prediction (use this one)
preds_corrected    <- preds_tmb[, "Est. (bias.correct)"]
preds_se_corrected <- preds_tmb[, "Std. (bias.correct)"]

Note that bias_adjust = "manual" is not supported for binomial models in brier_auc() since the analytical log-normal correction is only valid for log-link models. Use bias_adjust = "tmb" instead for binomial GLMMs.

Marginal vs conditional predictions: the `conditional_predictions` argument

By default, bias_precision() and brier_auc() use population-level (marginal) predictions for both in-sample and out-of-sample performance, setting random effects to zero (re.form = ~0). This assesses how well the model generalizes to new observations where group-level context is unavailable - the most relevant question when predicting to new sites, new years, or new individuals.

Setting conditional_predictions = TRUE switches both in-sample and out-of-sample predictions to condition on estimated random effects (re.form = NULL). Since GLAMMGoF’s holdout CV splits rows within groups rather than groups themselves, random effect estimates from the training model are valid for held-out observations from the same groups. In-sample and out-of-sample metrics remain directly comparable since both use the same prediction strategy; the gap between them reflects genuine row-level over-fitting rather than any marginal vs conditional distinction.

# Conditional predictions for within-group accuracy assessment
chk_cond <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                           testData = countData, method = "holdout",
                           DHARMaPlot = FALSE, conditional_predictions = TRUE,
                           seed = 123)
chk_cond$bias_precision_summary

Summary of `bias_adjust` and `conditional_predictions` combinations

The table below summarizes the behavior of each combination of bias_adjust and conditional_predictions in bias_precision(). Note that bias_adjust = "tmb" uses re.form = NULL as a computational necessity so that TMB’s automatic differentiation can access the RE variance structure; this is not a choice to make group-specific conditional predictions, and the output is still a corrected population-level marginal mean. This means conditional_predictions has no additional effect when bias_adjust = "tmb"; both settings produce the same TMB-corrected behavior. There are five distinct behaviors in total; the combination marked ❌ is blocked with an informative error.

`bias_adjust`	`conditional_predictions`	Prediction type	What it measures
`"none"` (default)	`FALSE` (default)	Marginal	Population-level (i.e., ignore REs) generalization; Jensen signal visible in RBIAS
`"manual"`	`FALSE`	Marginal + corrected	Population-level (i.e., ignore REs) generalization with analytical Jensen correction
`"tmb"`	`FALSE` or `TRUE`	Corrected population-level	Corrected population-level marginal mean via TMB’s AD; uses `re.form = NULL` internally as a computational requirement, not to obtain group-specific predictions
`"none"`	`TRUE`	Conditional	Within-group (i.e., predict to known RE groupings) accuracy; Jensen bias absorbed by REs
`"manual"`	`TRUE`	❌ Error	Double-correction - not permitted

For brier_auc(), the same logic applies with the exception that bias_adjust = "manual" is not available for binomial models regardless of conditional_predictions.

The default combination (bias_adjust = "none", conditional_predictions = FALSE) is recommended as the starting point for all analyses. It provides the most diagnostic information and surfaces Jensen’s inequality bias when present. The other combinations are best used in sequence as part of the step-wise diagnostic workflow described above. Note that there is currently no option to obtain truly marginal, TMB-corrected predictions, since the TMB correction is mathematically tied to conditional prediction.

Recommended workflow for inference and predictions

Based on the foregoing, we recommend the following general workflow for log-link GLMMs with random effects. Use marginal predictions (re.form = ~0) for all inferential purposes; fixed-effect contrasts, effect directions, p-values, and model comparison are unaffected by Jensen’s inequality bias. For response-scale quantities intended for reporting, figures, or management; predicted abundances, density estimates, and covariate effect plots; apply the lognormal bias correction via jensen_correct() or bias_adjust = "manual" in bias_precision(). This two-stage approach separates the inferential and predictive roles of the model and ensures that response-scale outputs are calibrated to the arithmetic mean of the population rather than the geometric mean.

Error messages when using `bias_precision()`

For obvious reasons, the code below will cause bias_precision quit and throw an error message:

bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Error in bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData,  : 
  Response variable is binary! Use brier_auc() instead

Error messages when using `brier_auc()`

For equally obvious reasons, the code below will cause brier_auc quit and throw an error message:

brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Error in brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData,  : 
  Response variable is not binary! Use bias_precision() instead

GLAMMGoF Tutorial

What does GLAMMGoF do?

Getting started

How is GLAMMGoF used?

Using the bias_precision() function

Using the brier_auc() function

Jensen’s inequality, random effects, and marginal predictions

When does Jensen bias matter? Substantive vs nuisance random effects

Diagnosing and correcting Jensen’s inequality bias

Marginal vs conditional predictions: the conditional_predictions argument

Summary of bias_adjust and conditional_predictions combinations

Recommended workflow for inference and predictions

Error messages when using bias_precision()

Error messages when using brier_auc()

Using the `bias_precision()` function

Using the `brier_auc()` function

Marginal vs conditional predictions: the `conditional_predictions` argument

Summary of `bias_adjust` and `conditional_predictions` combinations

Error messages when using `bias_precision()`

Error messages when using `brier_auc()`