Fast, independent random draws / sample from a list in python - fixed probability, not total number

Question

I'd like to draw a sample of items from a list, but I want to set the probability each item is included, not the total number of items to draw (so random.sample() does not work). I get the effect I want with the following code (where p is probability of inclusion, and items is the list of things):

[item for item in items if random.random() < p]

But it is very slow. Any suggestions for speeding it up?

The list is up to 10 million items long and single typed (all integers), so maybe there's a numpy / pandas solution to this?

Thanks!

Nick

Alex Martelli · Accepted Answer

The number of items in your resulting sample (n attempts each independently with probability p) has a binomial distribution and thus can rapidly be randomly generated e.g with numpy:

sample_size = numpy.random.binomial(len(population). p)

Now, the_sample = random.sample(population, sample_size) gives you exactly what you desire -- the equivalent of randomly, independently picking each item in the population with the same probability p.

This is based on your example code which you say is too slow but also say it's otherwise OK -- i.e, the same p for every item in the population. If each item has a totally different p, this can't work (if there are a few different values of p it can work by stratified sampling -- segment the population into uniform ones, each sub-population with a single value of p, and get samples from each of them independently, then union them).

Fast, independent random draws / sample from a list in python - fixed probability, not total number

Answers (1)

Related Questions