Given iterable, select each element independently with probability p

Question

I need a function to do the following in Python.

I need to iterate through an input list and choose each element with probability p (which is small).

Here is a naive implementation to make clear what I'm doing.

def f(inputList,p):
    for element in inputList:
        if random.random()



Each random number generation is expensive.  We can do fewer random number generations by calculating initially how long it will take before the first success and then jumping to that entry. I have a method, but wonder if there is something already in existence or a better way to code this.  In principle I just need it for lists, but would like something that works on a general iterable.

def jump_calculation(p):
    if p == 1:
        return 0
    r = random.random()
    k = int(scipy.log(1-r)/scipy.log(1-p))
    return k
def binomial_choice(L,p):
    jump = jump_calculation(p)
    for element in L:
        jump -= 1
        if jump<0:
            yield element
            jump = jump_calculation(p)


Am I reinventing the wheel?  If not, are the obvious improvements to make the code easier to read?

Kolmar · Accepted Answer

The number of Bernoulli trials until the first success is represented by geometric distribution. So you can use it to generate the number of items to skip with numpy.random.geometric:

import itertools
import numpy

def binomial_choice(L, p):
    iterator = iter(L)
    while True:
        to_skip = numpy.random.geometric(p) - 1
        yield next(itertools.islice(iterator, to_skip, None))

This works for any iterators, and you don't have to know the length in advance.

But for Python 3.5+ you'd have to use a more complex version because of PEP 479:

def binomial_choice(L, p):
    iterator = iter(L)
    try:
        while True:
            to_skip = numpy.random.geometric(p) - 1
            yield next(itertools.islice(iterator, to_skip, None))
    except StopIteration:
        return

Usage examples:

In [1]: list(binomial_choice(range(100), 0.05))
Out[1]: [9, 15, 31, 53, 92, 93]

In [2]: list(binomial_choice(range(5), 1))
Out[2]: [0, 1, 2, 3, 4]

The distribution seems to be pretty correct:

In [5]: sum(len(list(binomial_choice(range(100), 0.05))) for i in range(100000)) / 100000
Out[5]: 4.99883

It is also faster than two of your approaches:

In [14]: timeit list(binomial_choice_geometric(range(1000), 0.01))
10000 loops, best of 3: 24.4 µs per loop

In [11]: timeit list(binomial_choice_geometric_3_5(range(1000), 0.01))
10000 loops, best of 3: 42.7 µs per loop

In [12]: timeit list(binomial_choice_jump_calculation(range(1000), 0.01))
1000 loops, best of 3: 596 µs per loop

In [13]: timeit list(binomial_choice_foreach_random(range(1000), 0.01))
1000 loops, best of 3: 203 µs per loop

It actually runs on the scale of the random.sample approach from the other answer (modified with the suggestion from the comment to use numpy.random.binomial to get the correct distribution), but doesn't require to have a list, and to know the len of the argument in advance:

In [19]: timeit list(binomial_choice_random_sample(range(1000), 0.01))
10000 loops, best of 3: 19.8 µs per loop

Given iterable, select each element independently with probability p

Answers (2)

Related Questions