Reputation: 23
I wish to sample rows of my table with a probability specific to each row.
My table has about 50 Million rows and I wish to sample 500,000 of those (i.e., 1%). It takes hours to do that. Do you have any idea how to make it more efficient, like using some C++ package (even though sample
and [
already seems to both written in C)?
The command I use so far:
myTableSample <- myTable[sample(1:dim(myTable)[1], 500000, prob = prob_vector),]
Thanks!
Upvotes: 0
Views: 100
Reputation: 73325
Well, this would be much faster
ind <- sample.int(dim(myTable)[1], 500000, prob = prob_vector)
ind <- sort(ind)
myTableSample <- myTable[ind, ]
Before sorting you are doing completely random access. But after sorting it is much better in terms of cpu cache utility.
Of course this is not yet the fastest. You can write this row subsetting in C, and it is (based on my previous experience) much faster than [?, ]
.
Upvotes: 1