Bootstrapping using tidymodels from a list of dataframes in R

Question

I am running a model using tidymodels, where split the data by group and run regressions on each individual dataframe. This works well. However, now I also need to bootstrap my results. I'm not sure how to build this into my existing code.

My original code looks something like this:

library(dplyr)

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year=year, group=group, value=value, female=female, smoker=smoker)

# cut the dataset into list
table_list <- dta %>%
  group_by(year, group) %>%
  group_split()

# fit model per subgroup
model_list <- lapply(table_list, function(x) glm(smoker ~ female, data=x,
                                                 family=binomial(link="probit")))
# predict
pred_list <- lapply(model_list, function(x) predict.glm(x, type = "response"))

I would like to bootstrap with replacement to obtain the bootstrapped predicted values. My gut feeling is that I should split the dataset further by creating random samples when I create the table_list. But how exactly do I do that?

Thanks for your help.

Julia Silge · Accepted Answer

This is fairly complex, with the grouping and the bootstrapping, so I would probably approach it like this, using map() two layers deep:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- tibble(year=year, group=group, value=value, female=female, smoker=smoker)


glm_boot_mods <- 
  dta %>%
  nest(data = c(-year, -group)) %>%
  mutate(boots = map(
    data,  
    ~ bootstraps(., times = 20) %>%
      mutate(model = map(.$splits, ~ glm(smoker ~ female, data = analysis(.x),
                                         family = binomial(link = "probit"))),
             preds = map2(model, .$splits, ~predict(.x, newdata = assessment(.y))))
    ))


glm_boot_mods
#> # A tibble: 35 × 4
#>     year group data               boots                
#>                                  
#>  1  2014     1  
#>  2  2015     4  
#>  3  2016     3  
#>  4  2017     2  
#>  5  2018     0  
#>  6  2014     3  
#>  7  2016     2  
#>  8  2018     1  
#>  9  2014     0  
#> 10  2015     6  
#> # … with 25 more rows

The first map() creates the bootstrap resamples for each grouping, and then we go one layer deeper and for each resample fit a model and predict for the heldout observations for that resample. You can see what that looks like inside here for the first group:

glm_boot_mods %>%
  head(1) %>% 
  pull(boots)
#> [[1]]
#> # Bootstrap sampling 
#> # A tibble: 20 × 4
#>    splits            id          model  preds      
#>                             
#>  1  Bootstrap01   
#>  2   Bootstrap02    
#>  3  Bootstrap03   
#>  4  Bootstrap04   
#>  5  Bootstrap05   
#>  6  Bootstrap06   
#>  7   Bootstrap07    
#>  8  Bootstrap08   
#>  9   Bootstrap09    
#> 10  Bootstrap10   
#> 11  Bootstrap11   
#> 12  Bootstrap12   
#> 13  Bootstrap13   
#> 14  Bootstrap14   
#> 15  Bootstrap15   
#> 16  Bootstrap16   
#> 17  Bootstrap17   
#> 18  Bootstrap18   
#> 19  Bootstrap19   
#> 20  Bootstrap20

^{Created on 2021-11-02 by the reprex package (v2.0.1)}

Notice that there are predictions for the heldout observations for each resample. Depending on what you want to do, you can use unnest() on the columns of glm_boot_mods you need to handle next.

Bootstrapping using tidymodels from a list of dataframes in R

Answers (1)

Related Questions