---
title: "GLAMMGoF Tutorial"
output: 
  rmarkdown::html_document:
    toc: true
    code_overflow: wrap
    number_sections: false
vignette: >
  %\VignetteIndexEntry{GLAMMGoF Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = T,
  warning = F,
  comment = NA,
  message = F,
  fig.width = 7, 
  fig.height = 4
)
```

#### What does GLAMMGoF do?

`GLAMMGoF` is an R package for resampling-based predictive assessment of generalized linear and generalized additive models. The two main functions in the package, `bias_precision()` and `brier_auc()`, quantify in-sample and out-of-sample predictive performance by repeatedly refitting a model structure to resampled training data and evaluating predictions on held-out testing data. These functions support model objects from a variety of R packages, including `glmmTMB()` from `glmmTMB`, `lmer()`, `glmer()`, and `glmer.nb()` from `lme4`, `glm.nb()` from `MASS`, `gam()` from `mgcv`, and `glm()` and `lm()` from the stats package. Optional residual diagnostics can also be performed using the `DHARMa` package.

-   `bias_precision()` is intended for models with continuous or integer response variables (e.g., Gaussian, Poisson, negative binomial, Tweedie, or Gamma distributions) and returns relative root mean squared error (RRMSE), relative mean absolute error (RMAE), relative median absolute error (RMedAE), and relative bias (RBIAS).

-   `brier_auc()` is intended for models with binary response variables (i.e., logistic regression) and returns AUC, Brier score, and log loss.

Both functions accept models with or without random effects, but predictive performance is based on population-level predictions, meaning random effects are excluded during prediction.

Two resampling methods are available via the method argument: repeated random holdout (method = "holdout", the default), which repeatedly partitions the data into training and testing subsets and is intuitive and widely used; and bootstrap resampling with out-of-bag evaluation (method = "bootstrap"), which samples with replacement and tends to make more efficient use of the available data, particularly for smaller datasets.

**Important note:** The functions in this package take an existing model fitted to the full dataset and assess the predictive performance of that model structure through repeated resampling. In each replicate, the model is refit to a resampled training dataset and evaluated on separate testing data. This workflow is intentional: fitting the final model to all available data maximizes information available for inference, while resampling-based predictive assessment provides a practical way to evaluate generalization without permanently withholding data. This approach is especially useful in ecological applications where sample sizes are often limited.

In general, `method = "holdout"` more closely reflects prediction to independent unseen observations and may be more intuitive for assessing generalization, whereas `method = "bootstrap"` makes more efficient use of the available data by evaluating performance on out-of-bag observations and may provide more stable estimates of predictive performance when sample sizes are small.

#### Getting started

You've either downloaded and installed `GLAMMGoF` from R-universe, GitHub, DevOps, or a local zipped package file. Once installed, you can access the vignette by running `browseVignettes("GLAMMGoF")`. Note that `ggplot2` is a dependency, and functions from numerous other R packages are imported and used as well.

```{r, warning=FALSE, message = FALSE, comment=NA}
library(GLAMMGoF)
```

```{r, warning=FALSE, message = FALSE, comment=NA, echo=FALSE}
library(dplyr)
library(tidyr)
```

#### How do I use GLAMMGoF?

There are only two functions in this package, and which one you use depends on the type of model being assessed:

`bias_precision()`: used for models with continuous or integer response variables and estimates four predictive performance metrics using repeated resampling: `RRMSE`, `RMAE`, `RMedAE`, and `RBIAS`. For all four metrics, values closer to 0 indicate better predictive performance (see `?bias_precision` for details). All three accuracy statistics (`RRMSE`, `RMAE`, and `RMedAE`) express prediction error as a percentage of the mean observed value, which places them on a common scale regardless of the units or magnitude of the response. To recover the corresponding raw error in the original units of the response, multiply any of these values by the observed mean and divide by 100. For example, an `RRMSE` of 18% with a sample mean of 50 implies a root mean squared error of 9 in the original units.

-   `RRMSE (Relative Root Mean Squared Error)`: The square root of average squared prediction errors, expressed as a percentage of the mean observed value. Because errors are squared before averaging, RRMSE penalizes large individual errors more heavily than small ones, making it sensitive to cases where the model produces occasional extreme predictions.

    ```         
    fit_cost_rrmse <- function(y, yhat) sqrt(mean((yhat - y)^2)) / mean(y) * 100
    ```

-   `RMAE (Relative Mean Absolute Error)`: The average absolute prediction error expressed as a percentage of the mean observed value. `RMAE` treats all errors proportionally to their size without the extra weight that squaring applies, and is often the most intuitive summary of typical prediction accuracy.

    ```         
    fit_cost_rmae  <- function(y, yhat) mean(abs(yhat - y)) / mean(y) * 100
    ```

-   `RMedAE (Relative Median Absolute Error)`: The median absolute prediction error expressed as a percentage of the mean observed value. Because it is based on the median rather than the mean of the error distribution, `RMedAE` is the most robust of the three to outlying predictions and gives the best picture of accuracy for a typical observation when the error distribution is skewed.

    ```         
    fit_cost_rmedae  <- function(y, yhat) median(abs(yhat - y)) / mean(y) * 100
    ```

    -   Taken together, these three statistics tell a more complete story than any one alone. If `RRMSE` is notably larger than `RMAE`, that signals a handful of high-leverage mispredictions are inflating the squared-error average, which is a pattern `RMedAE` will often fail to reflect if the bulk of predictions are accurate. Conversely, close agreement among all three suggests errors are roughly symmetric and there are no extreme outliers driving the summary.

-   `RBIAS (Relative Bias)`: The mean signed prediction error expressed as a percentage of the mean observed value, where positive values indicate systematic over-prediction and negative values indicate systematic under-prediction. `RBIAS` is independent of the accuracy metrics above: a model can be nearly unbiased on average yet still produce large errors, or it can be highly biased while still ranking observations correctly. Reporting `RBIAS` alongside `RRMSE` and `RMAE` therefore distinguishes random prediction noise from systematic directional error.

    ```         
    fit_cost_rbias <- function(y, yhat) mean(yhat - y) / mean(y) * 100
    ```

`brier_auc()`: used for models with binary response variables and estimates three predictive performance metrics using repeated resampling: `AUC`, `Brier score`, and `log loss`. `AUC` values closer to 1 (bounded between 0 and 1) indicate better classification performance, whereas `Brier score` (bounded between 0 and 1) and `log loss` (lower bound of 0, no upper bound) values closer to 0 indicate better probabilistic predictions (see `?brier_auc` for details).

-   The three binary metrics each capture a different aspect of predictive performance and are most informative when interpreted together.

-   `AUC (Area Under the ROC Curve)`: The probability that a randomly chosen positive case receives a higher predicted probability than a randomly chosen negative case, ranging from 0 (no discrimination) to 1 (perfect discrimination), and where an `AUC` of 0.5 indicates that a model predicts no better than random chance (and `AUC` \<0.5 is worse than random). `AUC` is purely a measure of discrimination (the model's ability to rank cases by their probability of the outcome) and is insensitive to how well the predicted probabilities are calibrated in absolute terms. This function also returns the null model (i.e., intercept-only) `AUC` statistic, which is assumed to be 0.5, which allows for comparison to the summarized `AUC` metrics.

-   `Brier Score`: Across all observations, the mean squared difference between predicted probabilities and observed binary outcomes (0/1), ranging from 0 (perfect) to 1 (worst). The `Brier score` jointly rewards good discrimination and good calibration, penalizing confident but wrong predictions more heavily than uncertain ones. A model can have high `AUC` but a poor `Brier score` if its predicted probabilities are systematically too high or too low even while correctly ordering cases. This function also returns the null model (i.e., intercept-only) `Brier score` for reference (i.e., the best predictive performance achievable without any predictors), computed as `p_bar * (1 - p_bar)`, where `p_bar` is the observed event rate, which allows for comparison to the summarized Brier score metrics. Note that the null Brier score is based on the arithmetic mean of the response, which approximates but is not identical to plogis(intercept) from a fitted intercept-only logistic model; the difference is negligible in practice and avoids the overhead of model fitting.

-   `Log Loss (Binary Cross-Entropy)`: Across all observations, the mean negative log-likelihood of the observed outcomes under the predicted probabilities, where lower values indicate better performance. Like the `Brier score`, `log loss` rewards calibration, but its logarithmic scale means it penalizes extreme, confidently wrong predictions far more aggressively. `Log loss` is therefore the most sensitive of the three to cases where the model assigns near-zero probability to an outcome that actually occurs. This function also returns the null model (i.e., intercept-only) log loss for reference (i.e., the best predictive performance achievable without any predictors), computed as `-(p_bar * log(p_bar) + (1 - p_bar) * log(1 - p_bar))`, where `p_bar` is the observed event rate, which allows for comparison to the summarized log loss metrics. Note that the null log loss is based on the arithmetic mean of the response, which approximates but is not identical to plogis(intercept) from a fitted intercept-only logistic model; the difference is negligible in practice and avoids the overhead of model fitting. Log loss is computed internally as:

```         
    log_loss <- function(y, p, eps = 1e-15) {p <- pmax(pmin(p, 1 - eps), eps) -mean(y * log(p) + (1 - y) * log(1 - p))}
```

-   Reporting all three together is recommended: `AUC` reflects whether the model correctly separates cases from non-cases; `Brier score` provides an overall summary of probability accuracy on a squared-error scale; and `log loss` highlights whether the model is making any dangerously overconfident errors that the `Brier score` might underweight.

Both functions assess in- and out-of-sample predictive performance for generalized linear and generalized additive models, with or without random effects, by repeatedly resampling data and evaluating predictive performance across multiple replicates. Results are stored across all replicates and reported as a summary table (mean and 95% confidence intervals) and a histogram of replicate values for each performance statistic, with a blue dotted vertical line indicating the mean across replicates. For `brier_auc()`, a red dashed vertical line is also displayed in the log loss panels, indicating the null (intercept-only) log loss baseline as a reference point for model comparison.

Two resampling methods are available via the method argument. The default (`method = "holdout"`) uses repeated random holdout (Monte Carlo cross-validation), which splits the data into a training set (80% by default, controlled by `propTrain`) and a test set (the remaining 20%). The model is refit to the training data and performance is evaluated on both the training data (in-sample performance) and the withheld test data (out-of-sample performance). This method is intuitive and widely understood, but the fixed training/test split means the training set is always smaller than the full dataset, which can be a limitation for smaller datasets.

Alternatively, `method = "bootstrap"` uses bootstrap resampling with replacement, where each bootstrap training dataset contains the same number of observations as the original dataset, sampled with replacement. Out-of-sample performance is evaluated on the out-of-bag observations not selected in the bootstrap sample (approximately 36.8% of observations on average). This method makes more efficient use of the available data and tends to produce more stable estimates, particularly for smaller datasets, but is slightly less intuitive than the holdout approach.

For well-behaved models and reasonably sized datasets, both methods should produce similar results. Differences are most likely to emerge with small datasets, highly overdispersed data, or poorly specified models.

For zero-heavy count or continuous data, the choice of performance metric matters. `RBIAS` in particular can behave unexpectedly when the response contains zeros or values near zero, because the relative error formulation divides by the observed value, producing unstable or extreme terms that can dominate the average and give a misleading impression of systematic underprediction. This is a known limitation of relative bias metrics rather than evidence of poor model fit (Hyndman and Koehler 2006), and is most likely to occur when zeros are common but the zero-generating process is not explicitly modeled.

-   In these situations, `RMedAE` or `RMAE` may provide more reliable summaries of predictive performance. `RMedAE` is robust to extreme errors because it is based on the median absolute error rather than the mean, and therefore better reflects predictive accuracy for a typical observation. `RMAE` can also be useful because absolute-error metrics weight errors linearly rather than quadratically, making them less sensitive to extreme observations than `RMSE`-based measures.

-   When combined with the `DHARMa` zero-inflation test (`testZI = TRUE`), these metrics provide a more complete assessment than any single measure alone. `RMedAE` or `RMAE` summarize predictive accuracy, while the ZI test evaluates whether the model correctly reproduces the observed frequency of zeros. A significant ZI test result with `ratioObsSim > 1` indicates that the model is underpredicting zeros, suggesting that a zero-inflated model family or additional sources of heterogeneity should be considered.

This entire process is repeated `nReps` times, and the resulting sampling distributions of the performance metrics are summarized. Because `RRMSE`, `RMAE`, `RMedAE`, `RBIAS`, `AUC`, `Brier score`, and `log loss` quantify different aspects of predictive performance, they should be interpreted as complementary rather than interchangeable

There are two example datasets, `countData` and `logitData`, with integer and binary response variables, respectively. Each dataset also includes a variety of continuous and categorical predictor variables. Additionally, there are 12 example model objects (6 beginning with `count` and 6 beginning with `logit`): `Model_GLM`, `Model_GLMM`, `Model_GLMM2`, `Model_GAM`, `Model_GAMM`, and `Model_GAMM2`. Models ending in `M` include only fixed effects, whereas models ending in `MM` and `MM2` include one and two random effects, respectively. All 12 example models were fitted using `glmmTMB`, but `GLAMMGoF` supports models fitted using a variety of compatible `R` packages

#### Using the `bias_precision()` function

To use this function properly, you must:

1.  Be sure you're using it with the correct model type, which in this case is one with a continuous or integer response variable. Here, a variety of distributions (e.g., Gaussian, Poisson, negative binomial, Tweedie, and gamma) are supported, as are numerous R packages (`glmmTMB`, `lme4` (including `glmer`, `glmer.nb`, and `lmer`), `MASS` (`glm.nb`), `mgcv` (`gam`), and `glm`/`lm` from the `stats` package).

2.  The only other choices you need to make are:

    1.  The number of bootstrap or Monte Carlo replicates you want to use (the default is `nReps=100` but this should be at least 500 or 1000).

    2.  For `method = "holdout"` only: The proportion of data you want to use for assessing in-sample predictive performance (the default is 0.8).

    3.  Whether you want to use the `DHARMa` package's `simulateResiduals()` function to assess and plot the model's residuals (the default is `DHARMaPlot=TRUE`), whether you also want `DHARMa's` to test for zero inflation using the `testZeroInflation()` function (the default is `testZI=TRUE`), and how many simulation replicates you want `DHARMa` to use (the default is `DHARMaReps=1000`).

Now let's run it and see what happens...

```{r, warning=FALSE, comment=NA}
bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
bp$bias_precision_results <- bp$bias_precision_results %>% mutate(across(value, ~ round(., 2)))
```

Look at the first few fit statistics...

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(bp$bias_precision_results, 8))
```

 

And then look at a summary of the `nReps` values...

```{r, warning=FALSE, comment=NA, echo=FALSE}
bp$bias_precision_summary <- bp$bias_precision_summary %>% mutate(across(c(mn, lwr95, upr95), ~ round(., 2)))
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(bp$bias_precision_summary, 8))
```

 

And now look at a plot of the summaries

```{r, warning=FALSE, comment=NA, message=FALSE, echo=FALSE}
bp$bias_precision_hist
```

 

Now let's compare to the `bootstrap` method, which generally should be very similar for well-behaved models and reasonably sized data sets.

```{r, warning=FALSE, comment=NA}
bp <- bias_precision(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")
```

Summary of `bootstrap` method

```{r, warning=FALSE, comment=NA, echo=FALSE}
bp$bias_precision_summary <- bp$bias_precision_summary %>% mutate(across(c(mn, lwr95, upr95), ~ round(., 2)))
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(bp$bias_precision_summary, 8))
```

And now look at a plot of the summaries

```{r, warning=FALSE, comment=NA, message=FALSE, echo=FALSE}
bp$bias_precision_hist
```

#### Using the `brier_auc()` function

To use this function properly, you must:

1.  Be sure you're using it with the correct model type, which in this case is only a logistic regression with a binary response variable. Additionally, logistic regression models fit using the R packages `glmmTMB`, `lme4`, and `glm`/`lm` from the `stats` package are all supported.

2.  The only other choices you need to make are:

    1.  The number of bootstrap or Monte Carlo replicates you want to use (the default `nReps=100` but this should be at least 500 or 1000).

    2.  For `method = "holdout"` only: The proportion of data you want to use for assessing in-sample predictive performance (the default is `propTrain= 0.8`).

    3.  Whether you want to use the `DHARMa` package's `simulateResiduals()` function to assess and plot the model's residuals (the default is `DHARMaPlot=TRUE`) and how many simulation replicates you want `DHARMa` to use (the default is `DHARMaReps=1000`). There is no option to test for zero-inflation with this function as this is not strictly relevant to logistic regression models with binary responses.

```{r, warning=FALSE, comment=NA}
br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
br_auc$brier_auc_results <- br_auc$brier_auc_results %>% mutate(across(value, ~ round(., 2)))
```

Look at the first few fit statistics...

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(br_auc$brier_auc_results, 9))
```

 

```{r, warning=FALSE, comment=NA, echo=FALSE}
br_auc$brier_auc_summary <- br_auc$brier_auc_summary %>% mutate(across(c(mn, lwr95, upr95), ~ round(., 2)))
```

 

And then look at a summary of the `nReps` values...

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(br_auc$brier_auc_summary, 9))
```

 

And now look at a plot of the summaries

```{r, warning=FALSE, comment=NA, message=FALSE, echo=FALSE}
br_auc$brier_auc_hist
```

 

Let's again compare to the `bootstrap` method, which generally should be very similar for well-behaved models and reasonably sized data sets.

```{r, warning=FALSE, comment=NA}
br_auc <- brier_auc(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "bootstrap")
```

Summary of `bootstrap` method

```{r, warning=FALSE, comment=NA, echo=FALSE}
br_auc$brier_auc_summary <- br_auc$brier_auc_summary %>% mutate(across(c(mn, lwr95, upr95), ~ round(., 2)))
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
knitr::kable(head(br_auc$brier_auc_summary, 9))
```

And now look at a plot of the summaries

```{r, warning=FALSE, comment=NA, message=FALSE, echo=FALSE}
br_auc$brier_auc_hist
```

#### Error messages when using `bias_precision()`

For obvious reasons, the code below will cause `bias_precision` quit and throw an error message:

```{r, warning=FALSE, comment=NA, eval=FALSE}
bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
try(bias_precision(nReps = 100, testModel = logitModel_GLMM, testData = logitData, propTrain = 0.8, DHARMaPlot = TRUE, testZI = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout"))
```

#### Error messages when using `brier_auc()`

For equally obvious reasons, the code below will cause `brier_auc` quit and throw an error message:

```{r, warning=FALSE, comment=NA, eval=FALSE}
brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout")
```

```{r, warning=FALSE, comment=NA, echo=FALSE}
try(brier_auc(nReps = 100, testModel = countModel_GLMM, testData = countData, propTrain = 0.8, DHARMaPlot = TRUE, DHARMaReps = 1000, seed = 123, method = "holdout"))
```