Alex_H
Alex_H

Reputation: 23

R: sampling rows out of a huge table with row-specific probabilities

I wish to sample rows of my table with a probability specific to each row.

My table has about 50 Million rows and I wish to sample 500,000 of those (i.e., 1%). It takes hours to do that. Do you have any idea how to make it more efficient, like using some C++ package (even though sample and [ already seems to both written in C)?

The command I use so far:

myTableSample <- myTable[sample(1:dim(myTable)[1], 500000, prob = prob_vector),]

Thanks!

Upvotes: 0

Views: 100

Answers (1)

Zheyuan Li
Zheyuan Li

Reputation: 73325

Well, this would be much faster

ind <- sample.int(dim(myTable)[1], 500000, prob = prob_vector)
ind <- sort(ind)
myTableSample <- myTable[ind, ]

Before sorting you are doing completely random access. But after sorting it is much better in terms of cpu cache utility.

Of course this is not yet the fastest. You can write this row subsetting in C, and it is (based on my previous experience) much faster than [?, ].

Upvotes: 1

Related Questions