Alex
Alex

Reputation: 11

How to bootstrap a loes function and estimate confidence intervals in R

I keep on going around in circles trying to bootstrap confidence intervals for my data. I've only got very rudimentary knowledge about stats and am having trouble adapting the code such as here.

My aim is to be able to predict the mean, confidence intervals, and sd for n values (say, 300) along the x range of the data (ie. from 27.05575 to 144.75700, but can truncate the data if needed for the bootstrapping processes).

Sample code for generating the loess.

# create a data frame
df <- data.frame(
  DBH = c(27.05575, 30.10165, 41.36365, 48.31459, 64.64380, 64.88845, 65.55535, 75.12160, 79.40695, 113.27850, 114.68800, 120.68150, 125.24300, 130.27200, 132.17600, 144.75700),
  length = c(0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.5056656, 0.4686661, 1.5143648, 1.2282208, 0.3701741, 19.2412440, 51.3086010, 33.4588765, 254.6009090, 35.0538617, 59.5713370, 195.1270735),
  normalised = c(0.000000000, 0.000000000, 0.000000000, 0.000000000, 0.005913827, 0.001840787, 0.005947995, 0.004824102, 0.001453939, 0.075574137, 0.201525600, 0.131416956, 1.000000000, 0.137681605, 0.233979278, 0.76640368)
)

model <- loess(normalised ~ DBH, data= df, span = .8)
xrange <- range(subData$DBH)
xseq <- seq(from=xrange[1], to=xrange[2], length=300)
pred <- predict(model, newdata = data.frame(DBH = xseq), se=TRUE)
yfit = pred$fit

predictionDataFrame <- data.frame(xseq, yfit) %>%
  rename(DBH = xseq, normalised = yfit)

ggplot(data = predictionDataFrame, aes(x = DBH, y = normalised)) +
  geom_line(size = 2) +
  geom_point(data = df, aes(x = DBH, y = normalised)) +
  theme_bw()

loess with 0.8 smoothing

Side note - I'd prefer a less smooth curve, but as there are some gaps in my data, I run into some weirdness when I use a lower smoothing parameter. Ie this is the curve for 0.6:

loess with 0.6 smoothing

Besides from the 'span' parameter, are there other ways to control the loes? Changing the other parameters doesn't seem to do much. However, using the loess.boot function from the spatialEco package, the fitted curves seem more targeted than just the raw loess function with 0.8 smoothing. This last image is a comparison of a couple of different measurements of mine using the loess.boot function from spatialEco (thick lines) and the loess function (dashed lines). I'd prefer not to rely on that package and go through the process manually so I understand what's happening.

comparing loess predictions predictions.

Upvotes: 0

Views: 431

Answers (1)

Mikael Poul Johannesson
Mikael Poul Johannesson

Reputation: 1349

As commented by Gregor Thomas, you have to put your code for fitting the model and getting prediction in functions. It is then relatively straight forward to use e.g. tidymodels to apply bootstrap resampling to estimate uncertainty. (Though I give no guarantees that these estimates of uncertainty are statistically sound for whatever inference you will try to use them for.)

Here is an example where I've taken your code for fitting the model and making predictions as verbatim as possible from question and made them into functions, and then used a tidymodels approach to estimate the model and make predictions on 10k bootstrap samples:

library(dplyr)
library(purrr)
library(tidymodels)

set.seed(2023)

df <- data.frame(
  DBH = c(27.05575, 30.10165, 41.36365, 48.31459, 64.64380, 64.88845, 65.55535, 75.12160, 79.40695, 113.27850, 114.68800, 120.68150, 125.24300, 130.27200, 132.17600, 144.75700),
  length = c(0.0000000, 0.0000000, 0.0000000, 0.0000000, 1.5056656, 0.4686661, 1.5143648, 1.2282208, 0.3701741, 19.2412440, 51.3086010, 33.4588765, 254.6009090, 35.0538617, 59.5713370, 195.1270735),
  normalised = c(0.000000000, 0.000000000, 0.000000000, 0.000000000, 0.005913827, 0.001840787, 0.005947995, 0.004824102, 0.001453939, 0.075574137, 0.201525600, 0.131416956, 1.000000000, 0.137681605, 0.233979278, 0.76640368)
)

fit_loess_on_bootstrap <- function(split) {
  loess(normalised ~ DBH, data = analysis(split), span = .8)
}

extract_prediction <- function(model, xrange) {
  xseq <- seq(from = xrange[1], to = xrange[2], length = 300)
  pred <- predict(model, newdata = data.frame(DBH = xseq), se = TRUE)
  tibble(term = xseq, estimate = pred$fit)
}

boots <-
  df %>%
  bootstraps(10000) %>%
  mutate(
    model = map(splits, fit_loess_on_bootstrap),
    preds = map(model, extract_prediction, xrange = range(df$DBH)),
    spline = map(model, augment)
  )

## Look at a sample of individual fitted loess curves:
boots %>%
  sample_n(100) %>%
  unnest(cols = c(spline)) %>%
  ggplot(aes(DBH, normalised)) +
  geom_line(aes(DBH, .fitted, group = id), alpha = .2) +
  geom_point(data = df) +
  theme_bw()


## Estimate CI using the percentile method:
results <-
  boots %>%
  int_pctl(preds)

results %>%
  ggplot(aes(term, .estimate, ymin = .lower, ymax = .upper)) +
  geom_ribbon(fill = "grey75") +
  geom_line() +
  labs(x = "DBH", y = "normalised") +
  theme_bw()

EDIT 23-04-03: Here is an example on how to extract the SD from each point.

We can extract the predictions for each point from each bootstrap sample by unnesting our preds column. (The preds column is just a list of data frames, one for each bootstrap sample -- so to extract the predictions you could use any method of row binding them together, such as boots_preds <- do.call("rbind", boots$preds)).

If we then group by term, which denotes the point in the range, and use summarize(), we can summarize whatever you want at each point in the range --- including the standard deviation, mean, median, and so on. For instance:

## EDIT 23-04-03: Extract mean and sd from the boots:

results_sd <-
  boots %>%
  unnest(cols = c(preds)) %>%
  filter(!is.na(estimate)) %>%
  group_by(term) %>%
  summarize(
    avg = mean(estimate),
    med = median(estimate),
    std_dev = sd(estimate)
  ) %>%
  ungroup()

print(results_sd)

# A tibble: 300 × 4
    term      avg         med std_dev
   <dbl>    <dbl>       <dbl>   <dbl>
 1  27.1 0.00131   0.000207   0.00252
 2  27.4 0.00116   0.000149   0.00231
 3  27.8 0.00101   0.000128   0.00211
 4  28.2 0.000872  0.0000926  0.00193
 5  28.6 0.000732  0.0000519  0.00176
 6  29.0 0.000595  0.0000146  0.00161
 7  29.4 0.000460 -0.00000783 0.00148
 8  29.8 0.000327 -0.0000393  0.00138
 9  30.2 0.000566 -0.0000133  0.00199
10  30.6 0.000423 -0.0000236  0.00187
# … with 290 more rows

Upvotes: 0

Related Questions