Austin Downey
Austin Downey

Reputation: 1065

Weighted random sample without replacement in python

I need to obtain a k-sized sample without replacement from a population, where each member of the population has a associated weight (W).

Numpy's random.choices will not perform this task without replacement, and random.sample won't take a weighted input.

Currently, this is what I am using:

P = np.zeros((1,Parent_number))
n=0
while n < Parent_number:
    draw = random.choices(population,weights=W,k=1)
    if draw not in P:
        P[0,n] = draw[0]
        n=n+1
P=np.asarray(sorted(P[0])) 

While this works, it reqires switching back and forth from arrays, to lists and back to arrays and is, therefore, less than ideal.

I am looking for the simplest and easiest to understand solution as this code will be shared with others.

Upvotes: 31

Views: 24029

Answers (3)

Todd
Todd

Reputation: 5415

numpy is likely the best option. But here's another pure Python solution for weighted samples without replacement.

There are a couple ways to define the purpose of the parameters for population and weights. population can be defined to represent the total population of items, and weights a list of biases that influence selection. For instance, in a horse race simulation, population could be the horses - each unique with a name, and weights their performance ratings. The functions below follow this model.

from random import random
from bisect import bisect_left
from itertools import accumulate

def wsample(population, weights, k=1):
    wts   = list(weights)
    sampl = []
    rnums = [random() for _ in range(k)]
    for r in rnums:
        acm_wts = list(accumulate(wts))
        total   = acm_wts[-1]
        i       = bisect_left(acm_wts, total * r)
        p       = population[i]
        wts[i]  = 0
        sampl.append(p)
    return sampl

Selected individuals are effectively removed from further selections by setting their weight to 0, and recalculating the accumulated weights. If using this, ensure k <= len(population).

The first version provides a good point of reference for testing this second version. The version below is very fast compared to the first.

In this next version, the accumulated weights are computed once, and collisions in the sampling incur retries. This has the effect of removing ranges from the possible selections, while the ranges that still haven't been taken hold bands relatively proportioned to the other active bands to keep the correct probabilities of selection in play.

A dictionary keyed on selected indices ensures each selected member is a unique individual. The dict retains the order the items are added and returns them in the order of selection.

The idea seems to work. The outcomes under testing compare very closely between these two implementations.

def wsample(population, weights, k=1):
    accum = list(accumulate(weights))
    total = accum[-1]
    sampl = {}
    while len(sampl) < k:
        index        = bisect_left(accum, total * random())
        sampl[index] = population[index]
    return list(sampl.values())

Despite the fact that the chances for extra looping more than k times are high (depending on the parameters) each selection, the elimination of the O(n) accumulate() operation each iteration more than makes up for it in faster execution times. This could be made even faster if it required the weights to be pre-accumulated, but for my application these need to be calculated each cycle once anyway.

To use this, one may want to put in a guard against infinite looping if it's possible in any application that uses it. And possibly put in a check or two to ensure the parameters are as expected for it to work.

In the tests below, the population consists of 10,000 items with the same corresponding randomly generated weights. This was run on a VM hosted on a computer over 10 years old - anyone can get better results than this, but it shows the relative speeds of the two approaches.

First version:

timeit.timeit("wsample(population, weights, k=5)", globals=globals(), number=10**4)
21.74719240899867

Second version:

timeit.timeit("wsample(population, weights, k=5)", globals=globals(), number=10**4)
4.32836378099455

Second version modified for weights pre-accumulated:

timeit.timeit("wsample(population, acm_weights, k=5)", globals=globals(), number=10**4)
0.05602245099726133

Upvotes: 3

Raymond Hettinger
Raymond Hettinger

Reputation: 226754

Built-in solution

As suggested by Miriam Farber, you can just use the numpy's builtin solution:

np.random.choice(vec,size,replace=False, p=P)

Pure python equivalent

What follows is close to what numpy does internally. It, of course, uses numpy arrays and numpy.random.choices():

from random import choices

def weighted_sample_without_replacement(population, weights, k=1):
    weights = list(weights)
    positions = range(len(population))
    indices = []
    while True:
        needed = k - len(indices)
        if not needed:
            break
        for i in choices(positions, weights, k=needed):
            if weights[i]:
                weights[i] = 0.0
                indices.append(i)
    return [population[i] for i in indices]

Related problem: Selection when elements can be repeated

This is sometimes called an urn problem. For example, given an urn with 10 red balls, 4 white balls, and 18 green balls, choose nine balls without replacement.

To do it with numpy, generate the unique selections from the total population count with sample(). Then, bisect the cumulative weights to get the population indices.

import numpy as np
from random import sample

population = np.array(['red', 'blue', 'green'])
counts = np.array([10, 4, 18])
k = 9

cum_counts = np.add.accumulate(counts)
total = cum_counts[-1]
selections = sample(range(total), k=k)
indices = np.searchsorted(cum_counts, selections, side='right')
result = population[indices]

To do this without *numpy', the same approach can be implemented with bisect() and accumulate() from the standard library:

from random import sample
from bisect import bisect
from itertools import accumulate

population = ['red', 'blue', 'green']
weights = [10, 4, 18]
k = 9

cum_weights = list(accumulate(weights))
total = cum_weights.pop()
selections = sample(range(total), k=k)
indices = [bisect(cum_weights, s) for s in selections]
result = [population[i] for i in indices]

Upvotes: 19

Miriam Farber
Miriam Farber

Reputation: 19664

You can use np.random.choice with replace=False as follows:

np.random.choice(vec,size,replace=False, p=P)

where vec is your population and P is the weight vector.

For example:

import numpy as np
vec=[1,2,3]
P=[0.5,0.2,0.3]
np.random.choice(vec,size=2,replace=False, p=P)

Upvotes: 38

Related Questions