Reputation: 883
I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.
For example, say my data looks like this:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]
This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.
The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.
In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.
The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.
Upvotes: 0
Views: 536
Reputation: 681
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
Upvotes: 1