Reputation: 47
I'm trying to write a function in R that calculates a central prediction and upper and lower prediction intervals from a trained caret model (i.e., a "train" object) using the 0.632+ Bootstrap approach.
In this effort, I'm attempting to follow a Python example (https://www.saattrupdan.com/posts/2020-03-01-bootstrap-prediction) as a guide. However, I'm having trouble replicating it in R. Any guidance would be appreciated.
My function is supposed to take a trained caret model, the training data, and new data as input and return prediction intervals. However, at present, my prediction interval values are not correct.
As highlighted in a comment by Mark Rieke, one issue is that the entire 0.632+ procedure needs to be done for every bootstrap split, but my current code fails to do this.
Here's my current code:
library(caret)
# Set the random seed for reproducibility
set.seed(123)
# Generate data
n <- 100
explainer <- runif(n)
y <- 1 + 0.2 * explainer + rnorm(n)
data <- data.frame(explainer, y)
# Fit linear regression models
fit_simple <- lm(y ~ explainer) # A plain old linear model
fit_caret <- train(
y = y,
x = data.frame(explainer),
method = "lm"
) # An identical model, but fit using caret
new_data <- data.frame(explainer = runif(15, min = -10, max = 10))
# Function to calculate prediction intervals using 0.632+ Bootstrap
calculate_prediction_intervals <- function(model, new_data, alpha = 0.05) {
# Extract training data and outcomes from the model
X_train <- base::subset(model$trainingData, select = -c(.outcome))
y_train <- as.numeric(model$trainingData$.outcome)
n <- nrow(X_train)
nbootstraps <- as.integer(sqrt(n))
# Initialize matrices to store bootstrap predictions and validation residuals
bootstrap_preds <- matrix(0, nrow(new_data), nbootstraps)
val_residuals <- matrix(0, n, nbootstraps)
for (b in 1:nbootstraps) {
train_idxs <- sample(1:n, n, replace = TRUE)
val_idxs <- setdiff(1:n, train_idxs)
# Fit a bootstrap sample of the model
fit_b <- train(
y = y_train[train_idxs],
x = X_train[train_idxs, , drop = FALSE],
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
# Compute validation set predictions and residuals
preds_val <- predict(fit_b, newdata = X_train[val_idxs, , drop = FALSE])
val_residuals[val_idxs, b] <- y_train[val_idxs] - preds_val
# Compute bootstrap predictions on new data
preds_new <- predict(fit_b, newdata = new_data)
bootstrap_preds[, b] <- preds_new
}
# Center the bootstrap predictions and residuals
bootstrap_preds <- bootstrap_preds - colMeans(bootstrap_preds)
val_residuals <- val_residuals - colMeans(val_residuals)
# Fit the original model to the full training data
fit <- train(
y = y_train,
x = X_train,
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
preds <- predict(fit, newdata = X_train)
train_residuals <- y_train - preds
# Calculate various values needed for 0.632+ Bootstrap
no_information_error <- mean(abs(sample(y_train) - sample(preds)))
generalization <- abs(colMeans(val_residuals) - mean(train_residuals))
no_information_val <- abs(no_information_error - train_residuals)
relative_overfitting_rate <- mean(generalization / no_information_val)
weight <- 0.632 / (1 - 0.368 * relative_overfitting_rate)
# Calculate prediction residuals
residuals <- (1 - weight) * train_residuals + weight * colMeans(val_residuals)
# Calculate prediction percentiles
percentiles <- apply(bootstrap_preds, 1, function(x) {
quantile(x + residuals, probs = c(alpha / 2, 1 - alpha / 2))
})
# Create a data frame with predictions, lower, and upper limits
result <- data.frame(
fit = predict(fit, newdata = new_data),
lwr = percentiles[1, ],
upr = percentiles[2, ]
)
return(result)
}
My code fails to ~reproduce the expected prediction intervals for a linear model. Increasing the number of bootstrap resamples doesn't help this. Can you help me find where I went wrong?
> calculate_prediction_intervals(fit_caret, new_data)
fit lwr upr
1 1.18302967 -0.2597420 1.1699486
2 2.07894173 -1.4669930 7.0949444
3 0.71611677 -2.1804343 0.4431974
4 1.37767478 -0.6438284 2.5235400
5 1.68312227 -0.9393278 4.4294951
6 1.71845385 -1.0413210 4.8058089
7 0.06639059 -6.7192473 1.1929259
8 0.58836348 -3.2036975 0.7598031
9 1.55414870 -0.7131324 3.5583779
10 0.04536204 -6.8536552 1.2401264
11 1.76387322 -1.0177667 5.0307556
12 -0.01836307 -7.4146538 1.4246235
13 1.29583653 -0.4646119 2.0345750
14 0.18768121 -5.8312821 1.0571434
15 1.33552830 -0.4831878 2.0921489
> predict(fit_simple, newdata = new_data, interval= "prediction")
fit lwr upr
1 1.18302967 -0.9262779 3.292337
2 2.07894173 -4.5686088 8.726492
3 0.71611677 -2.0877607 3.519994
4 1.37767478 -1.4345098 4.189859
5 1.68312227 -2.6904110 6.056656
6 1.71845385 -2.8512314 6.288139
7 0.06639059 -6.2672902 6.400071
8 0.58836348 -2.8285939 4.005321
9 1.55414870 -2.1238365 5.232134
10 0.04536204 -6.4117391 6.502463
11 1.76387322 -3.0606644 6.588411
12 -0.01836307 -6.8508475 6.814121
13 1.29583653 -1.1747848 3.766458
14 0.18768121 -5.4394392 5.814802
15 1.33552830 -1.2942424 3.965299
I am aware that alternatives to the method I am trying to replicate exist, e.g., conformal inference or even simply adding the raw residuals to predictions, but I'm hoping for a specific application here. The approach I am after should generally replicate the methods of https://arxiv.org/abs/2201.11676, similar to other approaches that have used tidymodels, e.g., https://www.bryanshalloway.com/2021/04/05/simulating-prediction-intervals/ and the workboots package (https://markjrieke.github.io/workboots/).
I plan to use this function on more complicated models (i.e., many predictors, not just linear models) from caret trained with the x and y data specified. I'm not using the formula method in caret. Due to this complexity, approaches that only work for linear models won't do the trick, either.
Upvotes: 0
Views: 399
Reputation: 47
Following the approach from the Workboots package, with only a few adjustments to work with the caret objects, we can get all the bootstrapped predictions (with the corrected residuals added), the quantiles of predictions for a given alpha, and the fit on new data using the following code.
Note: This is slightly different from the original Python effort in formulation, though it's the same in effect.
# Function to generate prediction intervals for a caret model using bootstrapping
predict_caret_boots <-
function(model,
n = 2000,
alpha = 0.05,
new_data) {
# Extract training data and outcomes from the model
X_train <- base::subset(model$trainingData, select = -c(.outcome))
y_train <- as.numeric(model$trainingData$.outcome)
# Initialize a list to store predictions
preds_list <- list()
# Loop through n bootstrap resamples
for (i in 1:n) {
# Create a bootstrap sample
train_idxs <- sample(length(y_train), replace = TRUE)
boot_X_train <- X_train[train_idxs, , drop = FALSE]
boot_y_train <- y_train[train_idxs]
boot_X_oob <- X_train[-train_idxs, , drop = FALSE]
boot_y_oob <- y_train[-train_idxs]
# Fit a model on the bootstrap sample
fit_b <- train(
y = boot_y_train,
x = boot_X_train,
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
# Make predictions on the new data
preds <- predict(fit_b, newdata = new_data)
# Make predictions on training data
preds_train <- predict(fit_b, newdata = boot_X_train)
# Make predictions on OOB data
preds_oob <- predict(fit_b, newdata = boot_X_oob)
# Calculate training residuals
resids_train <- boot_y_train - preds_train
resids_train <- resids_train - mean(resids_train)
# Calculate OOB residuals
resids_oob <- boot_y_oob - preds_oob
resids_oob <- resids_oob - mean(resids_oob)
# Calculate no-information error rate (rmse_ni) with RMSE as the loss function
combos <- tidyr::crossing(boot_y_train, preds_train)
rmse_ni <- caret::RMSE(combos$preds_train, combos$boot_y_train)
# Calculate overfit rate
rmse_oob <- caret::RMSE(boot_y_oob, preds_oob)
rmse_train <- caret::RMSE(boot_y_train, preds_train)
overfit <- (rmse_oob - rmse_train) / (rmse_ni - rmse_train)
# Calculate weight (if overfit = 0, weight = .632 & residual used will just be .632)
# Use the actual proportion of distinct training/OOB samples, rather than the average of 0.632/0.368
prop_368 <- length(boot_y_oob) / length(boot_y_train)
prop_632 <- 1 - prop_368
weight <- prop_632 / (1 - (prop_368 * overfit))
# Determine residual std.dev based on weight
sd_oob <- stats::sd(resids_oob)
sd_train <- stats::sd(resids_train)
sd_resid <- weight * sd_oob + (1 - weight) * sd_train
# Add residuals to predictions
preds <- preds + stats::rnorm(length(preds), 0, sd_resid)
# Create a data frame with predictions and add it to the list
preds_df <- data.frame(fit = preds)
preds_list[[i]] <- preds_df
}
# Calculate quantiles for each row of preds_list
preds_list <- data.frame(preds_list)
quantiles <-
apply(preds_list, 1, function(row)
quantile(row, probs = c(alpha / 2, 1 - alpha / 2)))
# Get the central fit, too
fit_new <- predict(model, new_data)
result <- list(
preds = data.frame(preds_list),
quantiles = t(data.frame(quantiles)),
fit = data.frame(fit_new)
)
return(result)
}
A little adjustment to this function could help it explicitly handle preprocessing options from caret, etc. But for now, this appears to do the trick beautifully!
Upvotes: 0