deschen
deschen

Reputation: 10996

R bootstrapping resampling with multiple observations per id and return the resample data as result

I'm trying to do a bootstrap for my data. My data (df) has the following shape.

id    v1    v2
1    1    1
1    0    1
1    0    1
2    2    0
2    1    1
2    0    0

As far as I understand, when initializing the bootstrap in R, the resampling (with reoplacement) is done on the row level, right?

so setting up sth. like:

boot_function <- function(data, i)
{boot_data <- data[i,]}

However, my first question is, how would I set this up in a scenario where I have several observations per id that need to be kept together in the bootstrap? So in my example, when doing a bootstrap, I can't simply sample among rows, but I need to sample among ids. So instead of the above

I used this one:

boot_function2 <- function(data, i)
{boot_data <- data[data$id %in% i,]}

Would that be the correct way?

And related to the above scenario I wanted to check if my approach is right, so I thought I just check how the resamples look like, but I've no idea how I can return the single bootstrap sample data frames. Any idea? (and I know, if my original data is large and I'm doing like 2000 replicates, the return object could be quite large, so i'll probably just want to spotcheck this with R=10 or so).

Upvotes: 2

Views: 1886

Answers (2)

alan ocallaghan
alan ocallaghan

Reputation: 3038

I think bootstrapping by sample ID is absolutely fine. Here's an example using the boot package. I'm not sure if I understood exactly what you're bootstrapping so the function may not be exactly right, but you should be able to understand more or less what it's doing. It's not very efficient; I haven't optimised it at all given that I'm not sure about the statistic.

library("boot")
ids <- rep(1:3, times = 1000)
values <- rnorm(300)

dat <- data.frame(ids, values)

boot_fun <- function(ids, i) {
  sapply(ids[i], function(j) mean(dat[dat$ids == j, "values"]))
}


boot_res <- boot(
  dat$ids,
  statistic = boot_fun,
  R = 100
)
hist(boot_res$t)

Created on 2019-11-08 by the reprex package (v0.3.0)

Upvotes: 1

MDEWITT
MDEWITT

Reputation: 2368

Here is an approach. I will first generate some fake data:

ids <- rep(1:3, times = 10)
values <- rnorm(30)

dat <- data.frame(ids, values)

Now that we have data, we can generate the cluster bootstrapping function. This will sample from within each cluster and return a new dataframe. Then you can apply your test statistic:

library(tidyverse)

cluster_boot_function <- function(x){

  clusted_boot <- dat %>% 
    group_by(ids) %>% 
    nest() %>%
    mutate(samps = map(data, ~sample(.$values, size = 5, replace = T))) %>% 
    select(ids, samps) %>% 
    unnest(cols = samps)


  results <- clusted_boot %>% 
    group_by(ids) %>% 
    summarise(mu = mean(values))

  results
}

Now you just need to apply it repeatedly (also note that the "x" in the function doesn't do anything, I just need it there for the next step).

Here I use the map_dfr to return my summary statistics for each iteration:

out <- map_df(1:100, cluster_boot_function, .id = "iteration")

And this will give you your statistics for each iteration of the bootstrap:

# A tibble: 300 x 3
   iteration   ids    mu
   <chr>     <int> <dbl>
 1 1             1 0.150
 2 1             2 0.150
 3 1             3 0.150
 4 2             1 0.150
 5 2             2 0.150
 6 2             3 0.150
 7 3             1 0.150
 8 3             2 0.150

From this you could extend it to whatever kind of modeling you need to do.

Upvotes: 1

Related Questions