Dudelstein
Dudelstein

Reputation: 674

Bootstrapping in R - each sample comprising of multiple rows

With an example dataframe pay, I am bootstrapping using base R. The main difference from classical bootstrapping is that a sample can have multiple rows which must all be included.

There are 7 ID's in pay, hence my goal is to create a sample of length 7 with replacement and create a new dataset resample containing the sampled ID's.

My code currently works but is inefficient given one million rows in my data and many repetitions required by bootstrap.

Creating pay:

ID    <- c(1,1,1,2,3,3,4,4,4,4)
level <-  c(1:10)
pay <- data.frame(ID = ID,level =  level)

My (inefficient) code for creating a single resampled dataset:

IDs <- levels(as.factor(ID))
samp <- sample(IDs, length(IDs) , replace = TRUE)
resample <- numeric(0)

for (i in 1:length(IDs))        
    {
temp <-  pay[pay$ID == samp[i], ]
resample <- rbind(resample, temp) 
    }

Result:

 samp
[1] "1" "2" "3" "1"


 resample
  ID level
1  1   0.5
2  1  -2.0
3  1   3.0
4  2   4.0
5  3   5.0
6  3   6.0
7  1   0.5
8  1  -2.0
9  1   3.0

I think the slowest part is extending resample with every iteration. However, I do not know how many rows there will be at the end. Thanks a lot for your help.

Upvotes: 0

Views: 1826

Answers (2)

Dan W
Dan W

Reputation: 121

I recently had to do this myself with a large data frame, and I found @Josh's code to be inefficient to the point of being completely impractical for use in a bootstrap.

Instead, I wrote the following code, which seems to reduce the amount of computing time to a trivial amount:

# Draw a sample of IDs from the data frame
# Length of sample is equal to the number of unique IDs in your data frame
samp <- sample(unique(df$id), length(unique(df$id)), replace=TRUE)

# Create a data frame tracking number of occurrences of IDs in the sample  
df_table <- as.data.frame(table(samp))
df_table$samp <- as.numeric(levels(df_table$samp))[df_table$samp]

# Initialize some variables for the loop that creates the bootstrap data frame  
a <- 1
df_boot <- data.frame()
  
while(a <= max(df_table$Freq)){
    
  id_boot <- df_table[df_table$Freq >= a, 1]
  df_boot <- rbind(df_boot, df[df$id %in% id_boot, ])
  a <- a + 1
    
}

The trick here is that we are directly indexing rows from the data frame, giving R exact marching orders rather than telling it to scan the entire data frame to find the location of each and every ID from your sample, which is what @Josh 's code is doing. If your data frame had 5,000 unique IDs and 10,000 rows of data, that ultimately tells R to search through 5000 x 10000 = 50,000,000 rows of data for the computation, so you can see why it might take a significant amount of time to complete, which is impractical for bootstrapping which generally requires you to repeat your code thousands of times.

Instead, by using df[df$id %in% id_boot, ], we tell R exactly which rows of data we want to pull without requiring it to scan anything, and thus we are only dealing with the exact rows that have the data we want and not wasting any computing power on any rows of data that don't match what we are looking for.

I am able to run this code on a data frame with 10,000 rows of data and complete the operation in about 1-2 seconds, whereas @Josh's code took nearly a minute to complete.

Upvotes: 0

Josh
Josh

Reputation: 1278

You can sample the rows by doing

pay[sample(seq_len(nrow(pay)), replace=TRUE),]

It seems fairly efficient.

> system.time({
+   for (i in 1:10000)
+     pay[sample(seq_len(nrow(pay)), replace=TRUE),]
+ })
   user  system elapsed
  0.469   0.002   0.473

Edit:

Per Dudelstein's comment below, the above is incorrect. Here's a way to address what I think you're asking for.

samp <- sample(unique(ID), replace=TRUE)
do.call(rbind, lapply(samp, function(x) pay[pay$ID == x,]))

Benchmarking, it seems to be a third faster (roughly) compared to the original method. I'm sure there's a better way.

Upvotes: 2

Related Questions