sample() command is too slow in R

Question

I want to create a random subset of a data.table df that is very large (around 2 million lines). The data table has a weight column, wgt that indicates how many observation each line represents. To generate the vector of row numbers I want to extract, I proceed as follows:

I get the exact number of observations :

ns<- length(df$wgt)

I get the number of desired lines (30% of the sample):

lines<-round(0.3*ns)

I compute the vector of probabilities:

pr<-df$wgt/sum(df$wgt)

And then I compute the vector of line numbers to get the subsample:

ssout<-sample(1:ns, size=lines, probs=pr)

The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.

Is there a faster/more efficient way to do this?

Thank you!

sample() command is too slow in R

Answers (1)

Related Questions