Reputation: 353
I have the same question as Resample with replacement by cluster, i.e. I want to do cluster bootstrapping. The best answer's approach to that question using rbindlist(lapply(resampled_ids, function(resampled_id) df[df$id == resampled_id,]))
works, but because I have a big dataset, this resampling step is rather slow. My question is, is it possible to speed this up?
Upvotes: 1
Views: 285
Reputation: 17011
Use sequence
to index. Demonstrated with a larger data.frame
:
df <- data.frame(id = rep.int(1:1e2, sample(100:200, 1e2, replace = TRUE))[1:1e4], X = rnorm(1e4))
resampled_ids <- sample(unique(df$id), replace = TRUE)
idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(id = df$id[idx], X = df$X[idx])
Benchmarking against the rbindlist
solution:
library(data.table)
library(microbenchmark)
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 9480.4 9921.95 11470.567 10431.05 12555.35 31178.2 100
#> sequence 406.7 444.55 564.873 498.10 545.70 2818.4 100
Note that creating a new data.frame
from indexed vectors is much faster than row-indexing the original data.frame
. The difference is much less pronounced if a data.table
is used, but, surprisingly, the rbindlist
solution becomes even slower:
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 9431.9 9957.7 11101.545 10508.15 12395.25 15363.3 100
#> sequence1 4284.5 4550.3 4866.891 4674.80 5009.90 8350.1 100
#> sequence2 414.1 455.6 541.590 508.40 551.40 2881.1 100
setDT(df)
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.table(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 14877.4 15878.30 17181.572 16348.50 18527.6 22520.9 100
#> sequence1 795.0 1016.80 1187.266 1101.95 1326.7 2566.5 100
#> sequence2 386.4 441.75 556.226 473.70 500.9 3373.6 100
Update
To address the comment from jay.sf:
lens <- tabulate(df$id)[resampled_ids]
idx <- sequence(lens, match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(cluster = rep.int(seq_along(resampled_ids), lens), id = df$id[idx], X = df$X[idx])
cluster
corresponds to the index of resampled_ids
.
Upvotes: 1
Reputation: 11738
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
Try this:
ind_id <- split(seq_along(f$id), f$id)
samp_id <- sample(names(ind_id), replace = TRUE)
f[unlist(ind_id[samp_id]), ]
Upvotes: 0