GLAMMGoF Tutorial

What does GLAMMGoF do?

GLAMMGoF is an R package for resampling-based predictive assessment of generalized linear and generalized additive models. The two main functions in the package, bias_precision() and brier_auc(), quantify in-sample and out-of-sample predictive performance by repeatedly refitting a model structure to re-sampled training data and evaluating predictions on held-out testing data. These functions support model objects from a variety of R packages, including glmmTMB() from glmmTMB, lmer(), glmer(), and glmer.nb() from lme4, glm.nb() from MASS, gam() from mgcv, and glm() and lm() from the stats package. Optional residual diagnostics can also be performed using the DHARMa package.

Both functions accept models with or without random effects. By default, predictive performance is based on population-level (marginal) predictions, meaning random effects are excluded during prediction (re.form = ~0). This behavior can be changed via the conditional_predictions argument (see the Jensen’s inequality section below for details).

Two resampling methods are available via the method argument: repeated random holdout (method = “holdout”, the default), which repeatedly partitions the data into training and testing subsets and is intuitive and widely used; and bootstrap resampling with out-of-bag evaluation (method = “bootstrap”), which samples with replacement and tends to make more efficient use of the available data, particularly for smaller data sets.

Important note: The functions in this package take an existing model fitted to the full data set and assess the predictive performance of that model structure through repeated resampling. In each replicate, the model is refit to a re-sampled training data set and evaluated on separate testing data. This workflow is intentional: fitting the final model to all available data maximizes information available for inference, while resampling-based predictive assessment provides a practical way to evaluate generalization without permanently withholding data. This approach is especially useful in ecological applications where sample sizes are often limited.

In general, method = "holdout" more closely reflects prediction to independent unseen observations and may be more intuitive for assessing generalization, whereas method = "bootstrap" makes more efficient use of the available data by evaluating performance on out-of-bag observations and may provide more stable estimates of predictive performance when sample sizes are small.

Getting started

You’ve either downloaded and installed GLAMMGoF from R-universe, GitHub, DevOps, or a local zipped package file. Once installed, you can access the vignette by running browseVignettes("GLAMMGoF"). Note that ggplot2 is a dependency, and functions from numerous other R packages are imported and used as well.

library(GLAMMGoF)

How is GLAMMGoF used?

There are only two functions in this package, and which one you use depends on the type of model being assessed:

bias_precision(): used for models with continuous or integer response variables and estimates four predictive performance metrics using repeated resampling: RRMSE, RMAE, RMedAE, and RBIAS. For all four metrics, values closer to 0 indicate better predictive performance (see ?bias_precision for details). All three accuracy statistics (RRMSE, RMAE, and RMedAE) express prediction error as a percentage of the mean observed value, which places them on a common scale regardless of the units or magnitude of the response. To recover the corresponding raw error in the original units of the response, multiply any of these values by the observed mean and divide by 100. For example, an RRMSE of 18% with a sample mean of 50 implies a root mean squared error of 9 in the original units.

brier_auc(): used for models with binary response variables and estimates three predictive performance metrics using repeated resampling: AUC, Brier score, and log loss. AUC values closer to 1 (bounded between 0 and 1) indicate better classification performance, whereas Brier score (bounded between 0 and 1) and log loss (lower bound of 0, no upper bound) values closer to 0 indicate better probabilistic predictions (see ?brier_auc for details).

    log_loss <- function(y, p, eps = 1e-15) {p <- pmax(pmin(p, 1 - eps), eps) -mean(y * log(p) + (1 - y) * log(1 - p))}

Both functions assess in- and out-of-sample predictive performance for generalized linear and generalized additive models, with or without random effects, by repeatedly resampling data and evaluating predictive performance across multiple replicates. Results are stored across all replicates and reported as a summary table (mean and 95% confidence intervals) and a histogram of replicate values for each performance statistic, with a blue dotted vertical line indicating the mean across replicates. For brier_auc(), a red dashed vertical line is also displayed in the log loss panels, indicating the null (intercept-only) log loss baseline as a reference point for model comparison.

Two resampling methods are available via the method argument. The default (method = "holdout") uses repeated random holdout (Monte Carlo cross-validation), which splits the data into a training set (80% by default, controlled by propTrain) and a test set (the remaining 20%). The model is refit to the training data and performance is evaluated on both the training data (in-sample performance) and the withheld test data (out-of-sample performance). This method is intuitive and widely understood, but the fixed training/test split means the training set is always smaller than the full data set, which can be a limitation for smaller data sets.

Alternatively, method = "bootstrap" uses bootstrap resampling with replacement, where each bootstrap training data set contains the same number of observations as the original data set, sampled with replacement. Out-of-sample performance is evaluated on the out-of-bag observations not selected in the bootstrap sample (approximately 36.8% of observations on average). This method makes more efficient use of the available data and tends to produce more stable estimates, particularly for smaller data sets, but is slightly less intuitive than the holdout approach.

For well-behaved models and reasonably sized data sets, both methods should produce similar results. Differences are most likely to emerge with small data sets, highly overdispersed data, or poorly specified models.

For zero-heavy count or continuous data, the choice of performance metric matters. RBIAS in particular can behave unexpectedly when the response contains zeros or values near zero, because the relative error formulation divides by the observed value, producing unstable or extreme terms that can dominate the average and give a misleading impression of systematic underprediction. This is a known limitation of relative bias metrics rather than evidence of poor model fit (Hyndman and Koehler 2006), and is most likely to occur when zeros are common but the zero-generating process is not explicitly modeled.

This entire process is repeated nReps times, and the resulting sampling distributions of the performance metrics are summarized. Because RRMSE, RMAE, RMedAE, RBIAS, AUC, Brier score, and log loss quantify different aspects of predictive performance, they should be interpreted as complementary rather than interchangeable

There are two example data sets, countData and logitData, with integer and binary response variables, respectively. Each data set also includes a variety of continuous and categorical predictor variables. Additionally, there are 12 example model objects (6 beginning with count and 6 beginning with logit): Model_GLM, Model_GLMM, Model_GLMM2, Model_GAM, Model_GAMM, and Model_GAMM2. Models ending in M include only fixed effects, whereas models ending in MM and MM2 include one and two random effects, respectively. All 12 example models were fitted using glmmTMB, but GLAMMGoF supports models fitted using a variety of compatible R packages

Using the bias_precision() function

To use this function properly, you must:

  1. Be sure you’re using it with the correct model type, which in this case is one with a continuous or integer response variable. Here, a variety of distributions (e.g., Gaussian, Poisson, negative binomial, Tweedie, and gamma) are supported, as are numerous R packages (glmmTMB, lme4 (including glmer, glmer.nb, and lmer), MASS (glm.nb), mgcv (gam), and glm/lm from the stats package).

  2. The only other choices you need to make are:

    1. The number of bootstrap or Monte Carlo replicates you want to use (the default is nReps=100 but this should be at least 500 or 1000).

    2. For method = "holdout" only: The proportion of data you want to use for assessing in-sample predictive performance (the default is 0.8).

    3. Whether you want to use the DHARMa package’s simulateResiduals() function to assess and plot the model’s residuals (the default is DHARMaPlot=TRUE), whether you also want DHARMa's to test for zero inflation using the testZeroInflation() function (the default is testZI=TRUE), and how many simulation replicates you want DHARMa to use (the default is DHARMaReps=1000).

Now let’s run it and see what happens…

bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Look at the first few fit statistics…

simRep Group Metric value
1 In-sample performance RRMSE 105.96
1 Out-of-sample performance RRMSE 98.81
1 In-sample performance RMedAE 47.52
1 Out-of-sample performance RMedAE 48.30
1 In-sample performance RMAE 69.30
1 Out-of-sample performance RMAE 68.08
1 In-sample performance RBIAS -4.76
1 Out-of-sample performance RBIAS -0.79

 

And then look at a summary of the nReps values…

Group Metric mn lwr95 upr95
In-sample performance RRMSE 104.83 101.13 107.46
In-sample performance RMAE 68.92 67.68 70.10
In-sample performance RMedAE 47.42 45.76 48.98
In-sample performance RBIAS -4.77 -6.16 -3.22
Out-of-sample performance RRMSE 104.25 93.90 118.32
Out-of-sample performance RMAE 69.17 64.06 76.01
Out-of-sample performance RMedAE 47.44 41.29 54.63
Out-of-sample performance RBIAS -4.27 -16.88 11.77

 

And now look at a plot of the summaries

 

Now let’s compare to the bootstrap method, which generally should be very similar for well-behaved models and reasonably sized data sets.

bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")

Summary of bootstrap method

Group Metric mn lwr95 upr95
In-sample performance RRMSE 104.68 96.95 112.63
In-sample performance RMAE 68.68 65.40 72.17
In-sample performance RMedAE 47.24 44.61 50.68
In-sample performance RBIAS -5.13 -7.97 -2.41
Out-of-sample performance RRMSE 104.66 93.40 113.78
Out-of-sample performance RMAE 69.33 65.18 73.82
Out-of-sample performance RMedAE 47.64 41.25 53.65
Out-of-sample performance RBIAS -4.52 -18.39 6.43

And now look at a plot of the summaries

Using the brier_auc() function

To use this function properly, you must:

  1. Be sure you’re using it with the correct model type, which in this case is only a logistic regression with a binary response variable. Additionally, logistic regression models fit using the R packages glmmTMB, lme4, and glm/lm from the stats package are all supported.

  2. The only other choices you need to make are:

    1. The number of bootstrap or Monte Carlo replicates you want to use (the default nReps = 100 but this should be at least 500 or 1000).

    2. For method = "holdout" only: The proportion of data you want to use for assessing in-sample predictive performance (the default is propTrain = 0.8).

    3. Whether you want to use the DHARMa package’s simulateResiduals() function to assess and plot the model’s residuals (the default is DHARMaPlot=TRUE) and how many simulation replicates you want DHARMa to use (the default is DHARMaReps=1000). There is no option to test for zero-inflation with this function as this is not strictly relevant to logistic regression models with binary responses.

br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")

Look at the first few fit statistics…

simRep Group Metric value
1 In-sample performance AUC statistic 0.76
1 In-sample performance Brier score 0.15
1 In-sample performance Log loss 0.46
1 Out-of-sample performance AUC statistic 0.70
1 Out-of-sample performance Brier score 0.14
1 Out-of-sample performance Log loss 0.43
2 In-sample performance AUC statistic 0.77
2 In-sample performance Brier score 0.14
2 In-sample performance Log loss 0.45

 

 

And then look at a summary of the nReps values…

Group Metric mn lwr95 upr95
In-sample performance AUC statistic 0.75 0.74 0.77
In-sample performance Brier score 0.14 0.14 0.15
In-sample performance Log loss 0.45 0.43 0.46
Out-of-sample performance AUC statistic 0.74 0.66 0.80
Out-of-sample performance Brier score 0.15 0.12 0.18
Out-of-sample performance Log loss 0.46 0.40 0.56
Null model (baseline) AUC statistic 0.50 NA NA
Null model (baseline) Brier score 0.17 NA NA
Null model (baseline) Log loss 0.52 NA NA

 

And now look at a plot of the summaries

 

Let’s again compare to the bootstrap method, which generally should be very similar for well-behaved models and reasonably sized data sets.

br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")

Summary of bootstrap method

Group Metric mn lwr95 upr95
In-sample performance AUC statistic 0.75 0.72 0.79
In-sample performance Brier score 0.14 0.13 0.16
In-sample performance Log loss 0.45 0.42 0.49
Out-of-sample performance AUC statistic 0.74 0.68 0.79
Out-of-sample performance Brier score 0.15 0.13 0.17
Out-of-sample performance Log loss 0.46 0.41 0.52
Null model (baseline) AUC statistic 0.50 NA NA
Null model (baseline) Brier score 0.17 NA NA
Null model (baseline) Log loss 0.52 NA NA

And now look at a plot of the summaries

Jensen’s inequality, random effects, and marginal predictions

When a GLMM includes random effects and a nonlinear link function (e.g., log or logit), back-transforming population-level (marginal) predictions to the response scale introduces a systematic bias. This occurs because the exponential function curves upward – averaging on the log scale and then back-transforming always gives a smaller value than back-transforming individual predictions and then averaging (e.g., a bowl-shaped or exponential growth curve). This property is known as Jensen’s inequality. For a log-link model with random effects, this means:

\[E[\exp(\beta_0 + u)] = \exp(\beta_0) \cdot \exp(\sigma^2/2)\]

where \(\sigma^2\) is the total random effect variance. In other words, \(\exp(\beta_0)\) , the typical marginal prediction, underestimates the true arithmetic mean by a factor of \(\exp(\sigma^2/2)\).

This bias is directly analogous to the well-known log-normal back-transformation bias. The math is identical; the only difference is where the log-scale variance comes from:

Log-normal model Log-link GLMM
Model structure \(\log(Y) \sim \text{Normal}(\mu, \sigma^2)\) \(\log(\mu_{ij}) = X\beta + u_j\), \(u_j \sim \text{Normal}(0, \sigma^2)\)
Log-scale variance Residual \(\sigma^2\) Random effect variance \(\sigma^2\)
Naive back-transform \(\exp(\mu)\) \(\exp(X\beta)\)
True arithmetic mean \(\exp(\mu + \sigma^2/2)\) \(\exp(X\beta + \sigma^2/2)\)
Correction factor \(\exp(\sigma^2/2)\) \(\exp(\sigma^2/2)\)

In the log-normal case, the log-scale variance is the residual variance and the correction is widely known. In the log-link GLMM case, the log-scale variance is the random effect variance and the same correction applies; however, this connection seems to be rarely made explicit in ecological statistics training. A fixed-effects GLM with a log link has no random component on the linear predictor and therefore no Jensen bias; \(\exp(\beta_0)\) is exactly the arithmetic mean.

This bias grows with random effect variance, is invisible to standard diagnostics (DHARMa, AIC, residual checks), and although it does not affect fixed-effect inference or contrasts, it can substantially affect response-scale predictions used for management.

For models with multiple random effects, the variances are additive on the log scale:

\[\text{correction} = \exp\left(\frac{\sigma^2_1 + \sigma^2_2 + \cdots + \sigma^2_k}{2}\right)\]

This is not a model deficiency: it is a structural property of GLMMs. Consistent negative RBIAS in bias_precision() output signals that this heterogeneity is being ignored in marginal predictions. For purely inferential purposes (contrasts, p-values, effect directions) this does not matter. For calibrated response-scale predictions (abundance maps, density estimates, management outputs), the correction should be applied.

When does Jensen bias matter? Substantive vs nuisance random effects

Whether Jensen bias is consequential for a given analysis depends critically on the nature of the random effects in the model and the purpose of the predictions. Users should consider two broad categories:

Substantive random effects represent real, ecologically meaningful group-level variation that is relevant to prediction. For these, Jensen bias matters for response-scale predictions because marginalizing over the RE loses real information about the quantity being predicted:

Nuisance random effects capture study-specific variation that is not relevant to forward predictions or generalization. For these, marginalizing over the RE is conceptually correct and the Jensen bias is largely irrelevant:

The general principle is: if the random effect represents variation you would want to condition on in a prediction, the Jensen bias matters. If it represents study-specific noise that should not carry forward, marginalizing over it is correct and the Jensen bias is a non-issue. Whether to apply the bias correction is therefore a scientific judgment call that depends on the nature of the random effects, the purpose of the predictions, and the magnitude of the bias given the estimated RE variance. bias_precision() surfaces the bias and provides tools to quantify and correct it; the decision of whether to act on it rests with the user.

Diagnosing and correcting Jensen’s inequality bias

bias_precision() provides a bias_adjust argument with three options for diagnosing and correcting Jensen’s inequality bias in glmmTMB models. A message is automatically triggered when both in-sample and out-of-sample RBIAS are consistently negative (< -10%), prompting the user to investigate. The recommended diagnostic workflow is:

Step 1 - Run with defaults (bias_adjust = "none") and inspect RBIAS. A consistently negative RBIAS in both in-sample and out-of-sample performance suggests Jensen’s inequality may be contributing.

# Step 1: default marginal predictions, Jensen signal visible
chk_none <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                           testData = countData, method = "holdout",
                           DHARMaPlot = FALSE, bias_adjust = "none", seed = 123)
chk_none$bias_precision_summary

Step 2 - Re-run with bias_adjust = "manual". If RBIAS moves toward zero, Jensen’s inequality is confirmed as the source of the negative bias rather than model misspecification or poor generalization. The analytical correction \(\exp(\sigma^2/2)\) is applied to marginal predictions using the total random effect variance from VarCorr().

# Step 2: analytical lognormal correction
chk_manual <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                             testData = countData, method = "holdout",
                             DHARMaPlot = FALSE, bias_adjust = "manual", seed = 123)
chk_manual$bias_precision_summary

Step 3 (optional) - Re-run with bias_adjust = "tmb" to validate the manual correction using TMB’s automatic differentiation. If manual and TMB agree closely, the analytical correction is reliable for your RE structure. Note that this step is considerably slower than Step 2.

# Step 3 (optional): TMB validation
chk_tmb <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                          testData = countData, method = "holdout",
                          DHARMaPlot = FALSE, bias_adjust = "tmb", seed = 123)
chk_tmb$bias_precision_summary

Once Jensen’s inequality has been diagnosed as the source of negative RBIAS, the correction can be applied directly to response-scale predictions outside of GLAMMGoF. Note that although the calculations are explicitly written out below, GLAMMGoF has a jensen_correct() which provides a convenient wrapper for this calculation, accepting predictions, standard errors, and confidence interval bounds on the link or response scale (see ?jensen_correct for details):

# Manual lognormal correction (fast, reliable for random intercept models)
vc         <- VarCorr(countModel_GLMM)$cond
correction <- exp(sum(sapply(vc, function(x) x[1, 1])) / 2)
preds_adjusted <- predict(countModel_GLMM, newdata = countData,
                          type = "response", re.form = ~0) * correction

# Standard errors for manually corrected predictions should also be scaled
# by the same correction factor, since it is a multiplicative constant.
# The delta method gives: SE_corrected = SE_original * correction
preds_se   <- predict(countModel_GLMM, newdata = countData,
                      type = "response", re.form = ~0, se.fit = TRUE)$se.fit
preds_se_adjusted <- preds_se * correction

# TMB correction (handles random slopes and complex RE structures).
# Unlike the manual correction, TMB automatically returns bias-corrected
# standard errors alongside point estimates via automatic differentiation --
# no additional scaling is needed.
preds_tmb  <- predict(countModel_GLMM, newdata = countData,
                      type = "response", re.form = ~0, do.bias.correct = TRUE)
# preds_tmb is a matrix with columns:
#   "Estimate"            -- uncorrected marginal prediction
#   "Std. Error"          -- SE of uncorrected prediction
#   "Est. (bias.correct)" -- bias-corrected prediction
#   "Std. (bias.correct)" -- SE of bias-corrected prediction (use this one)
preds_corrected    <- preds_tmb[, "Est. (bias.correct)"]
preds_se_corrected <- preds_tmb[, "Std. (bias.correct)"]

Note that bias_adjust = "manual" is not supported for binomial models in brier_auc() since the analytical log-normal correction is only valid for log-link models. Use bias_adjust = "tmb" instead for binomial GLMMs.

Marginal vs conditional predictions: the conditional_predictions argument

By default, bias_precision() and brier_auc() use population-level (marginal) predictions for both in-sample and out-of-sample performance, setting random effects to zero (re.form = ~0). This assesses how well the model generalizes to new observations where group-level context is unavailable - the most relevant question when predicting to new sites, new years, or new individuals.

Setting conditional_predictions = TRUE switches both in-sample and out-of-sample predictions to condition on estimated random effects (re.form = NULL). Since GLAMMGoF’s holdout CV splits rows within groups rather than groups themselves, random effect estimates from the training model are valid for held-out observations from the same groups. In-sample and out-of-sample metrics remain directly comparable since both use the same prediction strategy; the gap between them reflects genuine row-level over-fitting rather than any marginal vs conditional distinction.

# Conditional predictions for within-group accuracy assessment
chk_cond <- bias_precision(nReps = 100, testModel = countModel_GLMM,
                           testData = countData, method = "holdout",
                           DHARMaPlot = FALSE, conditional_predictions = TRUE,
                           seed = 123)
chk_cond$bias_precision_summary

Summary of bias_adjust and conditional_predictions combinations

The table below summarizes the behavior of each combination of bias_adjust and conditional_predictions in bias_precision(). Note that bias_adjust = "tmb" uses re.form = NULL as a computational necessity so that TMB’s automatic differentiation can access the RE variance structure; this is not a choice to make group-specific conditional predictions, and the output is still a corrected population-level marginal mean. This means conditional_predictions has no additional effect when bias_adjust = "tmb"; both settings produce the same TMB-corrected behavior. There are five distinct behaviors in total; the combination marked ❌ is blocked with an informative error.

bias_adjust conditional_predictions Prediction type What it measures
"none" (default) FALSE (default) Marginal Population-level (i.e., ignore REs) generalization; Jensen signal visible in RBIAS
"manual" FALSE Marginal + corrected Population-level (i.e., ignore REs) generalization with analytical Jensen correction
"tmb" FALSE or TRUE Corrected population-level Corrected population-level marginal mean via TMB’s AD; uses re.form = NULL internally as a computational requirement, not to obtain group-specific predictions
"none" TRUE Conditional Within-group (i.e., predict to known RE groupings) accuracy; Jensen bias absorbed by REs
"manual" TRUE ❌ Error Double-correction - not permitted

For brier_auc(), the same logic applies with the exception that bias_adjust = "manual" is not available for binomial models regardless of conditional_predictions.

The default combination (bias_adjust = "none", conditional_predictions = FALSE) is recommended as the starting point for all analyses. It provides the most diagnostic information and surfaces Jensen’s inequality bias when present. The other combinations are best used in sequence as part of the step-wise diagnostic workflow described above. Note that there is currently no option to obtain truly marginal, TMB-corrected predictions, since the TMB correction is mathematically tied to conditional prediction.

Error messages when using bias_precision()

For obvious reasons, the code below will cause bias_precision quit and throw an error message:

bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
Error in bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData,  : 
  Response variable is binary! Use brier_auc() instead

Error messages when using brier_auc()

For equally obvious reasons, the code below will cause brier_auc quit and throw an error message:

brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
Error in brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData,  : 
  Response variable is not binary! Use bias_precision() instead