Reputation: 23827
I need a function to do the following in Python.
I need to iterate through an input list and choose each element with probability p (which is small).
Here is a naive implementation to make clear what I'm doing.
def f(inputList,p):
for element in inputList:
if random.random()<p:
yield element
Each random number generation is expensive. We can do fewer random number generations by calculating initially how long it will take before the first success and then jumping to that entry. I have a method, but wonder if there is something already in existence or a better way to code this. In principle I just need it for lists, but would like something that works on a general iterable.
def jump_calculation(p):
if p == 1:
return 0
r = random.random()
k = int(scipy.log(1-r)/scipy.log(1-p))
return k
def binomial_choice(L,p):
jump = jump_calculation(p)
for element in L:
jump -= 1
if jump<0:
yield element
jump = jump_calculation(p)
Am I reinventing the wheel? If not, are the obvious improvements to make the code easier to read?
Upvotes: 4
Views: 377
Reputation: 14224
The number of Bernoulli trials until the first success is represented by geometric distribution. So you can use it to generate the number of items to skip with numpy.random.geometric
:
import itertools
import numpy
def binomial_choice(L, p):
iterator = iter(L)
while True:
to_skip = numpy.random.geometric(p) - 1
yield next(itertools.islice(iterator, to_skip, None))
This works for any iterators, and you don't have to know the length in advance.
But for Python 3.5+ you'd have to use a more complex version because of PEP 479:
def binomial_choice(L, p):
iterator = iter(L)
try:
while True:
to_skip = numpy.random.geometric(p) - 1
yield next(itertools.islice(iterator, to_skip, None))
except StopIteration:
return
Usage examples:
In [1]: list(binomial_choice(range(100), 0.05))
Out[1]: [9, 15, 31, 53, 92, 93]
In [2]: list(binomial_choice(range(5), 1))
Out[2]: [0, 1, 2, 3, 4]
The distribution seems to be pretty correct:
In [5]: sum(len(list(binomial_choice(range(100), 0.05))) for i in range(100000)) / 100000
Out[5]: 4.99883
It is also faster than two of your approaches:
In [14]: timeit list(binomial_choice_geometric(range(1000), 0.01))
10000 loops, best of 3: 24.4 µs per loop
In [11]: timeit list(binomial_choice_geometric_3_5(range(1000), 0.01))
10000 loops, best of 3: 42.7 µs per loop
In [12]: timeit list(binomial_choice_jump_calculation(range(1000), 0.01))
1000 loops, best of 3: 596 µs per loop
In [13]: timeit list(binomial_choice_foreach_random(range(1000), 0.01))
1000 loops, best of 3: 203 µs per loop
It actually runs on the scale of the random.sample
approach from the other answer (modified with the suggestion from the comment to use numpy.random.binomial
to get the correct distribution), but doesn't require to have a list, and to know the len
of the argument in advance:
In [19]: timeit list(binomial_choice_random_sample(range(1000), 0.01))
10000 loops, best of 3: 19.8 µs per loop
Upvotes: 4
Reputation: 1962
Not sure if this is what you want, but.. You could use random.sample()
to take a random sample from a list. It has an argument which specifies the sample size, and you can compute this size based on the list's length. I mean, if the probability is small, then the sample size is small.
from random import sample
a = range(100)
probability = 0.5
max_sz = int(len(a) * probability)
sz = randint(0, max_sz)
print sample(a, sz)
# [34, 81, 58, 52, 9, 86, 57, 29, 3, 99]
P.S. Oh, I just noticed that the idea was already introduced in comments, and that you want to be able to work with unknown iterable size. Sorry. Still, I'll leave it here.
Upvotes: 1