Reputation: 381
I want to create a random subset of a data.table
df
that is very large (around 2 million lines).
The data table has a weight column, wgt
that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]
. However, R gets stuck when computing ssout
.
Is there a faster/more efficient way to do this?
Thank you!
Upvotes: 1
Views: 1701
Reputation: 66819
I'm guessing that df
is a summary description of a data set that has repeated observations (with wgt
being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt)
:
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.
Upvotes: 3