Reputation: 10996
I'm trying to do a bootstrap for my data. My data (df) has the following shape.
id v1 v2
1 1 1
1 0 1
1 0 1
2 2 0
2 1 1
2 0 0
As far as I understand, when initializing the bootstrap in R, the resampling (with reoplacement) is done on the row level, right?
so setting up sth. like:
boot_function <- function(data, i)
{boot_data <- data[i,]}
However, my first question is, how would I set this up in a scenario where I have several observations per id that need to be kept together in the bootstrap? So in my example, when doing a bootstrap, I can't simply sample among rows, but I need to sample among ids. So instead of the above
I used this one:
boot_function2 <- function(data, i)
{boot_data <- data[data$id %in% i,]}
Would that be the correct way?
And related to the above scenario I wanted to check if my approach is right, so I thought I just check how the resamples look like, but I've no idea how I can return the single bootstrap sample data frames. Any idea? (and I know, if my original data is large and I'm doing like 2000 replicates, the return object could be quite large, so i'll probably just want to spotcheck this with R=10 or so).
Upvotes: 2
Views: 1886
Reputation: 3038
I think bootstrapping by sample ID is absolutely fine. Here's an example using the boot
package. I'm not sure if I understood exactly what you're bootstrapping so the function may not be exactly right, but you should be able to understand more or less what it's doing. It's not very efficient; I haven't optimised it at all given that I'm not sure about the statistic.
library("boot")
ids <- rep(1:3, times = 1000)
values <- rnorm(300)
dat <- data.frame(ids, values)
boot_fun <- function(ids, i) {
sapply(ids[i], function(j) mean(dat[dat$ids == j, "values"]))
}
boot_res <- boot(
dat$ids,
statistic = boot_fun,
R = 100
)
hist(boot_res$t)
Created on 2019-11-08 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 2368
Here is an approach. I will first generate some fake data:
ids <- rep(1:3, times = 10)
values <- rnorm(30)
dat <- data.frame(ids, values)
Now that we have data, we can generate the cluster bootstrapping function. This will sample from within each cluster and return a new dataframe. Then you can apply your test statistic:
library(tidyverse)
cluster_boot_function <- function(x){
clusted_boot <- dat %>%
group_by(ids) %>%
nest() %>%
mutate(samps = map(data, ~sample(.$values, size = 5, replace = T))) %>%
select(ids, samps) %>%
unnest(cols = samps)
results <- clusted_boot %>%
group_by(ids) %>%
summarise(mu = mean(values))
results
}
Now you just need to apply it repeatedly (also note that the "x" in the function doesn't do anything, I just need it there for the next step).
Here I use the map_dfr
to return my summary statistics for each iteration:
out <- map_df(1:100, cluster_boot_function, .id = "iteration")
And this will give you your statistics for each iteration of the bootstrap:
# A tibble: 300 x 3
iteration ids mu
<chr> <int> <dbl>
1 1 1 0.150
2 1 2 0.150
3 1 3 0.150
4 2 1 0.150
5 2 2 0.150
6 2 3 0.150
7 3 1 0.150
8 3 2 0.150
From this you could extend it to whatever kind of modeling you need to do.
Upvotes: 1