Reputation: 1931
I want to draw clusters (defined by the variable id
) with replacement from a dataset, and in contrast to previously answered questions, I want clusters that are chosen K times to have each observation repeated K times. That is, I'm doing cluster bootstrapping.
For example, the following samples id=1
twice, but repeats the observations for id=1
only once in the new dataset s
. I want all observations from id=1
to appear twice.
f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]
Upvotes: 2
Views: 570
Reputation: 14370
One option would be to lapply
over each new.id
and save it in a list. Then you can stack that all together:
library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
# id X
#1: 1 1.20118333
#2: 1 -0.01280538
#3: 1 1.20118333
#4: 1 -0.01280538
#5: 3 -0.07302158
#6: 3 -1.26409125
Upvotes: 3
Reputation: 2738
Just in case one would need to have a "new_id" that corresponded to the index number (i.e. sample order) -- (I needed to have "new_id" so that i could run mixed effects models without having several instances of a cluster treated as one cluster because they shared the same id):
library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
new.ids,
seq_along(new.ids),
SIMPLIFY=FALSE
))
ss
which gives:
> ss
id X new_id
1: 1 -0.3491670 1
2: 1 1.3676636 1
3: 1 -0.3491670 2
4: 1 1.3676636 2
5: 3 0.9051575 3
6: 3 -0.5082386 3
Note the values of X are different because set.seed is not set before the rnorm()
call, but the id is the same as the answer of @Mike H.
This link was useful to me in constructing this answer: R lapply statement with index [duplicate]
Upvotes: 1