R: sampling rows out of a huge table with row-specific probabilities

Question

I wish to sample rows of my table with a probability specific to each row.

My table has about 50 Million rows and I wish to sample 500,000 of those (i.e., 1%). It takes hours to do that. Do you have any idea how to make it more efficient, like using some C++ package (even though sample and [ already seems to both written in C)?

The command I use so far:

myTableSample <- myTable[sample(1:dim(myTable)[1], 500000, prob = prob_vector),]

Thanks!

Zheyuan Li · Accepted Answer

Well, this would be much faster

ind <- sample.int(dim(myTable)[1], 500000, prob = prob_vector)
ind <- sort(ind)
myTableSample <- myTable[ind, ]

Before sorting you are doing completely random access. But after sorting it is much better in terms of cpu cache utility.

Of course this is not yet the fastest. You can write this row subsetting in C, and it is (based on my previous experience) much faster than [?, ].

R: sampling rows out of a huge table with row-specific probabilities

Answers (1)

Related Questions