Reputation: 11
I keep on going around in circles trying to bootstrap confidence intervals for my data. I've only got very rudimentary knowledge about stats and am having trouble adapting the code such as here.
My aim is to be able to predict the mean, confidence intervals, and sd for n values (say, 300) along the x range of the data (ie. from 27.05575 to 144.75700, but can truncate the data if needed for the bootstrapping processes).
Sample code for generating the loess.
# create a data frame
df <- data.frame(
DBH = c(27.05575, 30.10165, 41.36365, 48.31459, 64.64380, 64.88845, 65.55535, 75.12160, 79.40695, 113.27850, 114.68800, 120.68150, 125.24300, 130.27200, 132.17600, 144.75700),
length = c(0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.5056656, 0.4686661, 1.5143648, 1.2282208, 0.3701741, 19.2412440, 51.3086010, 33.4588765, 254.6009090, 35.0538617, 59.5713370, 195.1270735),
normalised = c(0.000000000, 0.000000000, 0.000000000, 0.000000000, 0.005913827, 0.001840787, 0.005947995, 0.004824102, 0.001453939, 0.075574137, 0.201525600, 0.131416956, 1.000000000, 0.137681605, 0.233979278, 0.76640368)
)
model <- loess(normalised ~ DBH, data= df, span = .8)
xrange <- range(subData$DBH)
xseq <- seq(from=xrange[1], to=xrange[2], length=300)
pred <- predict(model, newdata = data.frame(DBH = xseq), se=TRUE)
yfit = pred$fit
predictionDataFrame <- data.frame(xseq, yfit) %>%
rename(DBH = xseq, normalised = yfit)
ggplot(data = predictionDataFrame, aes(x = DBH, y = normalised)) +
geom_line(size = 2) +
geom_point(data = df, aes(x = DBH, y = normalised)) +
theme_bw()
Side note - I'd prefer a less smooth curve, but as there are some gaps in my data, I run into some weirdness when I use a lower smoothing parameter. Ie this is the curve for 0.6:
Besides from the 'span' parameter, are there other ways to control the loes? Changing the other parameters doesn't seem to do much. However, using the loess.boot
function from the spatialEco
package, the fitted curves seem more targeted than just the raw loess
function with 0.8 smoothing. This last image is a comparison of a couple of different measurements of mine using the loess.boot
function from spatialEco
(thick lines) and the loess
function (dashed lines). I'd prefer not to rely on that package and go through the process manually so I understand what's happening.
Upvotes: 0
Views: 431
Reputation: 1349
As commented by Gregor Thomas, you have to put your code for fitting the model and getting prediction in functions. It is then relatively straight forward to use e.g. tidymodels
to apply bootstrap resampling to estimate uncertainty. (Though I give no guarantees that these estimates of uncertainty are statistically sound for whatever inference you will try to use them for.)
Here is an example where I've taken your code for fitting the model and making predictions as verbatim as possible from question and made them into functions, and then used a tidymodels
approach to estimate the model and make predictions on 10k bootstrap samples:
library(dplyr)
library(purrr)
library(tidymodels)
set.seed(2023)
df <- data.frame(
DBH = c(27.05575, 30.10165, 41.36365, 48.31459, 64.64380, 64.88845, 65.55535, 75.12160, 79.40695, 113.27850, 114.68800, 120.68150, 125.24300, 130.27200, 132.17600, 144.75700),
length = c(0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.5056656, 0.4686661, 1.5143648, 1.2282208, 0.3701741, 19.2412440, 51.3086010, 33.4588765, 254.6009090, 35.0538617, 59.5713370, 195.1270735),
normalised = c(0.000000000, 0.000000000, 0.000000000, 0.000000000, 0.005913827, 0.001840787, 0.005947995, 0.004824102, 0.001453939, 0.075574137, 0.201525600, 0.131416956, 1.000000000, 0.137681605, 0.233979278, 0.76640368)
)
fit_loess_on_bootstrap <- function(split) {
loess(normalised ~ DBH, data = analysis(split), span = .8)
}
extract_prediction <- function(model, xrange) {
xseq <- seq(from = xrange[1], to = xrange[2], length = 300)
pred <- predict(model, newdata = data.frame(DBH = xseq), se = TRUE)
tibble(term = xseq, estimate = pred$fit)
}
boots <-
df %>%
bootstraps(10000) %>%
mutate(
model = map(splits, fit_loess_on_bootstrap),
preds = map(model, extract_prediction, xrange = range(df$DBH)),
spline = map(model, augment)
)
## Look at a sample of individual fitted loess curves:
boots %>%
sample_n(100) %>%
unnest(cols = c(spline)) %>%
ggplot(aes(DBH, normalised)) +
geom_line(aes(DBH, .fitted, group = id), alpha = .2) +
geom_point(data = df) +
theme_bw()
## Estimate CI using the percentile method:
results <-
boots %>%
int_pctl(preds)
results %>%
ggplot(aes(term, .estimate, ymin = .lower, ymax = .upper)) +
geom_ribbon(fill = "grey75") +
geom_line() +
labs(x = "DBH", y = "normalised") +
theme_bw()
EDIT 23-04-03: Here is an example on how to extract the SD from each point.
We can extract the predictions for each point from each bootstrap sample by unnesting our preds
column. (The preds
column is just a list of data frames, one for each bootstrap sample -- so to extract the predictions you could use any method of row binding them together, such as boots_preds <- do.call("rbind", boots$preds)
).
If we then group by term
, which denotes the point in the range, and use summarize()
, we can summarize whatever you want at each point in the range --- including the standard deviation, mean, median, and so on. For instance:
## EDIT 23-04-03: Extract mean and sd from the boots:
results_sd <-
boots %>%
unnest(cols = c(preds)) %>%
filter(!is.na(estimate)) %>%
group_by(term) %>%
summarize(
avg = mean(estimate),
med = median(estimate),
std_dev = sd(estimate)
) %>%
ungroup()
print(results_sd)
# A tibble: 300 × 4
term avg med std_dev
<dbl> <dbl> <dbl> <dbl>
1 27.1 0.00131 0.000207 0.00252
2 27.4 0.00116 0.000149 0.00231
3 27.8 0.00101 0.000128 0.00211
4 28.2 0.000872 0.0000926 0.00193
5 28.6 0.000732 0.0000519 0.00176
6 29.0 0.000595 0.0000146 0.00161
7 29.4 0.000460 -0.00000783 0.00148
8 29.8 0.000327 -0.0000393 0.00138
9 30.2 0.000566 -0.0000133 0.00199
10 30.6 0.000423 -0.0000236 0.00187
# … with 290 more rows
Upvotes: 0