Doon_Bogan
Doon_Bogan

Reputation: 381

sample() command is too slow in R

I want to create a random subset of a data.table df that is very large (around 2 million lines). The data table has a weight column, wgt that indicates how many observation each line represents. To generate the vector of row numbers I want to extract, I proceed as follows:

I get the exact number of observations :

ns<- length(df$wgt)

I get the number of desired lines (30% of the sample):

lines<-round(0.3*ns)

I compute the vector of probabilities:

pr<-df$wgt/sum(df$wgt)

And then I compute the vector of line numbers to get the subsample:

ssout<-sample(1:ns, size=lines, probs=pr)

The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.

Is there a faster/more efficient way to do this?

Thank you!

Upvotes: 1

Views: 1701

Answers (1)

Frank
Frank

Reputation: 66819

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):

# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr  <- wgt/sum(wgt)

# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
#    user  system elapsed 
#    0.20    0.02    0.22

Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Upvotes: 3

Related Questions