Reputation: 564
I am running a model using tidymodels
, where split the data by group and run regressions on each individual dataframe. This works well. However, now I also need to bootstrap my results. I'm not sure how to build this into my existing code.
My original code looks something like this:
library(dplyr)
year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year=year, group=group, value=value, female=female, smoker=smoker)
# cut the dataset into list
table_list <- dta %>%
group_by(year, group) %>%
group_split()
# fit model per subgroup
model_list <- lapply(table_list, function(x) glm(smoker ~ female, data=x,
family=binomial(link="probit")))
# predict
pred_list <- lapply(model_list, function(x) predict.glm(x, type = "response"))
I would like to bootstrap with replacement to obtain the bootstrapped predicted values. My gut feeling is that I should split the dataset further by creating random samples when I create the table_list. But how exactly do I do that?
Thanks for your help.
Upvotes: 0
Views: 370
Reputation: 11623
This is fairly complex, with the grouping and the bootstrapping, so I would probably approach it like this, using map()
two layers deep:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- tibble(year=year, group=group, value=value, female=female, smoker=smoker)
glm_boot_mods <-
dta %>%
nest(data = c(-year, -group)) %>%
mutate(boots = map(
data,
~ bootstraps(., times = 20) %>%
mutate(model = map(.$splits, ~ glm(smoker ~ female, data = analysis(.x),
family = binomial(link = "probit"))),
preds = map2(model, .$splits, ~predict(.x, newdata = assessment(.y))))
))
glm_boot_mods
#> # A tibble: 35 × 4
#> year group data boots
#> <int> <dbl> <list> <list>
#> 1 2014 1 <tibble [288 × 3]> <bootstraps [20 × 4]>
#> 2 2015 4 <tibble [273 × 3]> <bootstraps [20 × 4]>
#> 3 2016 3 <tibble [301 × 3]> <bootstraps [20 × 4]>
#> 4 2017 2 <tibble [282 × 3]> <bootstraps [20 × 4]>
#> 5 2018 0 <tibble [276 × 3]> <bootstraps [20 × 4]>
#> 6 2014 3 <tibble [279 × 3]> <bootstraps [20 × 4]>
#> 7 2016 2 <tibble [314 × 3]> <bootstraps [20 × 4]>
#> 8 2018 1 <tibble [296 × 3]> <bootstraps [20 × 4]>
#> 9 2014 0 <tibble [304 × 3]> <bootstraps [20 × 4]>
#> 10 2015 6 <tibble [288 × 3]> <bootstraps [20 × 4]>
#> # … with 25 more rows
The first map()
creates the bootstrap resamples for each grouping, and then we go one layer deeper and for each resample fit a model and predict for the heldout observations for that resample. You can see what that looks like inside here for the first group:
glm_boot_mods %>%
head(1) %>%
pull(boots)
#> [[1]]
#> # Bootstrap sampling
#> # A tibble: 20 × 4
#> splits id model preds
#> <list> <chr> <list> <list>
#> 1 <split [288/111]> Bootstrap01 <glm> <dbl [111]>
#> 2 <split [288/93]> Bootstrap02 <glm> <dbl [93]>
#> 3 <split [288/103]> Bootstrap03 <glm> <dbl [103]>
#> 4 <split [288/106]> Bootstrap04 <glm> <dbl [106]>
#> 5 <split [288/109]> Bootstrap05 <glm> <dbl [109]>
#> 6 <split [288/109]> Bootstrap06 <glm> <dbl [109]>
#> 7 <split [288/92]> Bootstrap07 <glm> <dbl [92]>
#> 8 <split [288/111]> Bootstrap08 <glm> <dbl [111]>
#> 9 <split [288/99]> Bootstrap09 <glm> <dbl [99]>
#> 10 <split [288/111]> Bootstrap10 <glm> <dbl [111]>
#> 11 <split [288/102]> Bootstrap11 <glm> <dbl [102]>
#> 12 <split [288/104]> Bootstrap12 <glm> <dbl [104]>
#> 13 <split [288/115]> Bootstrap13 <glm> <dbl [115]>
#> 14 <split [288/111]> Bootstrap14 <glm> <dbl [111]>
#> 15 <split [288/108]> Bootstrap15 <glm> <dbl [108]>
#> 16 <split [288/110]> Bootstrap16 <glm> <dbl [110]>
#> 17 <split [288/110]> Bootstrap17 <glm> <dbl [110]>
#> 18 <split [288/111]> Bootstrap18 <glm> <dbl [111]>
#> 19 <split [288/103]> Bootstrap19 <glm> <dbl [103]>
#> 20 <split [288/109]> Bootstrap20 <glm> <dbl [109]>
Created on 2021-11-02 by the reprex package (v2.0.1)
Notice that there are predictions for the heldout observations for each resample. Depending on what you want to do, you can use unnest()
on the columns of glm_boot_mods
you need to handle next.
Upvotes: 1