Reputation: 3723
I'd like to draw a sample of items from a list, but I want to set the probability each item is included, not the total number of items to draw (so random.sample() does not work). I get the effect I want with the following code (where p is probability of inclusion, and items is the list of things):
[item for item in items if random.random() < p]
But it is very slow. Any suggestions for speeding it up?
The list is up to 10 million items long and single typed (all integers), so maybe there's a numpy / pandas solution to this?
Thanks!
Nick
Upvotes: 2
Views: 1621
Reputation: 882023
The number of items in your resulting sample (n
attempts each independently with probability p
) has a binomial distribution and thus can rapidly be randomly generated e.g with numpy
:
sample_size = numpy.random.binomial(len(population). p)
Now, the_sample = random.sample(population, sample_size)
gives you exactly what you desire -- the equivalent of randomly, independently picking each item in the population with the same probability p
.
This is based on your example code which you say is too slow but also say it's otherwise OK -- i.e, the same p
for every item in the population. If each item has a totally different p
, this can't work (if there are a few different values of p
it can work by stratified sampling -- segment the population into uniform ones, each sub-population with a single value of p
, and get samples from each of them independently, then union them).
Upvotes: 4