Stata_user
Stata_user

Reputation: 564

Bootstrapping using tidymodels from a list of dataframes in R

I am running a model using tidymodels, where split the data by group and run regressions on each individual dataframe. This works well. However, now I also need to bootstrap my results. I'm not sure how to build this into my existing code.

My original code looks something like this:

library(dplyr)

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year=year, group=group, value=value, female=female, smoker=smoker)

# cut the dataset into list
table_list <- dta %>%
  group_by(year, group) %>%
  group_split()

# fit model per subgroup
model_list <- lapply(table_list, function(x) glm(smoker ~ female, data=x,
                                                 family=binomial(link="probit")))
# predict
pred_list <- lapply(model_list, function(x) predict.glm(x, type = "response"))

I would like to bootstrap with replacement to obtain the bootstrapped predicted values. My gut feeling is that I should split the dataset further by creating random samples when I create the table_list. But how exactly do I do that?

Thanks for your help.

Upvotes: 0

Views: 370

Answers (1)

Julia Silge
Julia Silge

Reputation: 11623

This is fairly complex, with the grouping and the bootstrapping, so I would probably approach it like this, using map() two layers deep:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- tibble(year=year, group=group, value=value, female=female, smoker=smoker)


glm_boot_mods <- 
  dta %>%
  nest(data = c(-year, -group)) %>%
  mutate(boots = map(
    data,  
    ~ bootstraps(., times = 20) %>%
      mutate(model = map(.$splits, ~ glm(smoker ~ female, data = analysis(.x),
                                         family = binomial(link = "probit"))),
             preds = map2(model, .$splits, ~predict(.x, newdata = assessment(.y))))
    ))


glm_boot_mods
#> # A tibble: 35 × 4
#>     year group data               boots                
#>    <int> <dbl> <list>             <list>               
#>  1  2014     1 <tibble [288 × 3]> <bootstraps [20 × 4]>
#>  2  2015     4 <tibble [273 × 3]> <bootstraps [20 × 4]>
#>  3  2016     3 <tibble [301 × 3]> <bootstraps [20 × 4]>
#>  4  2017     2 <tibble [282 × 3]> <bootstraps [20 × 4]>
#>  5  2018     0 <tibble [276 × 3]> <bootstraps [20 × 4]>
#>  6  2014     3 <tibble [279 × 3]> <bootstraps [20 × 4]>
#>  7  2016     2 <tibble [314 × 3]> <bootstraps [20 × 4]>
#>  8  2018     1 <tibble [296 × 3]> <bootstraps [20 × 4]>
#>  9  2014     0 <tibble [304 × 3]> <bootstraps [20 × 4]>
#> 10  2015     6 <tibble [288 × 3]> <bootstraps [20 × 4]>
#> # … with 25 more rows

The first map() creates the bootstrap resamples for each grouping, and then we go one layer deeper and for each resample fit a model and predict for the heldout observations for that resample. You can see what that looks like inside here for the first group:

glm_boot_mods %>%
  head(1) %>% 
  pull(boots)
#> [[1]]
#> # Bootstrap sampling 
#> # A tibble: 20 × 4
#>    splits            id          model  preds      
#>    <list>            <chr>       <list> <list>     
#>  1 <split [288/111]> Bootstrap01 <glm>  <dbl [111]>
#>  2 <split [288/93]>  Bootstrap02 <glm>  <dbl [93]> 
#>  3 <split [288/103]> Bootstrap03 <glm>  <dbl [103]>
#>  4 <split [288/106]> Bootstrap04 <glm>  <dbl [106]>
#>  5 <split [288/109]> Bootstrap05 <glm>  <dbl [109]>
#>  6 <split [288/109]> Bootstrap06 <glm>  <dbl [109]>
#>  7 <split [288/92]>  Bootstrap07 <glm>  <dbl [92]> 
#>  8 <split [288/111]> Bootstrap08 <glm>  <dbl [111]>
#>  9 <split [288/99]>  Bootstrap09 <glm>  <dbl [99]> 
#> 10 <split [288/111]> Bootstrap10 <glm>  <dbl [111]>
#> 11 <split [288/102]> Bootstrap11 <glm>  <dbl [102]>
#> 12 <split [288/104]> Bootstrap12 <glm>  <dbl [104]>
#> 13 <split [288/115]> Bootstrap13 <glm>  <dbl [115]>
#> 14 <split [288/111]> Bootstrap14 <glm>  <dbl [111]>
#> 15 <split [288/108]> Bootstrap15 <glm>  <dbl [108]>
#> 16 <split [288/110]> Bootstrap16 <glm>  <dbl [110]>
#> 17 <split [288/110]> Bootstrap17 <glm>  <dbl [110]>
#> 18 <split [288/111]> Bootstrap18 <glm>  <dbl [111]>
#> 19 <split [288/103]> Bootstrap19 <glm>  <dbl [103]>
#> 20 <split [288/109]> Bootstrap20 <glm>  <dbl [109]>

Created on 2021-11-02 by the reprex package (v2.0.1)

Notice that there are predictions for the heldout observations for each resample. Depending on what you want to do, you can use unnest() on the columns of glm_boot_mods you need to handle next.

Upvotes: 1

Related Questions