naroslife
naroslife

Reputation: 15

Draw small sample from large set with discrete distribution efficiently

I have two lists, both the same size, let's call them elements and weights. I want to choose one element of the elements list with discrete probability distribution given by weights. weight[i] corresponds to the probability of choosing elements[i]. elements never changes, but after every sample taken, weights changes (only the values, not the size).

I need an efficient way to do this with large lists.

I have an implementation in Python with numpy.random.choice(elements, p=weights) but taking a sample of size k from a set of size n where k << n is extremely inefficient. An implementation in any language is welcome, but I am working primarily in Python.

(This is used in a social network simulation with networkx. I have a weighted graph and a node i and I want to choose a node j from i's neighbors where the probability for each node is proportional to the weight of the edge between i and the given node. If I set the probability to 0 for non-neighbors, I don't have to generate the list of neighbors every time, I just need a list of all nodes.)

It will be used like this:

elements = [...]
weights = [...]

for(...):
    element = sample(elements, weights)
    *Some calculation with element and changing the values of weights*

Upvotes: 1

Views: 368

Answers (2)

Paul Panzer
Paul Panzer

Reputation: 53029

@MarkBorgerding's approach is good but can be improved:

W = weights.cumsum()
W.searchsorted(np.random.uniform(0, W[-1], nsamples))

Also, it in the end depends on actual numbers, but instead of zeroing the probabilities of non neighbors it may well be more efficient to remove those probabilities; see Timings Part 2 below.

Timings:

1000000 options, single sample:

>>> from timeit import timeit
>>> kwds = dict(globals=globals(), number=100)
>>> weights = np.random.random(1000000)
>>> 
>>> timeit("np.random.choice(1000000, 1, p=weights/weights.sum())", **kwds)
1.606048938119784
>>> timeit("W = weights.cumsum(); W/=W[-1]; (np.random.uniform()<W).argmax()", **kwds)
0.6634919850621372
>>> timeit("W = weights.cumsum(); W.searchsorted(np.random.uniform(0, W[-1]))", **kwds)
0.30993065400980413

1000000 options, 10 samples:

>>> timeit("np.random.choice(1000000, 10, p=weights/weights.sum())", **kwds)
1.606177378911525
>>> timeit("W = weights.cumsum(); W/=W[-1]; (np.random.uniform(0, 1, (10, 1))<W).argmax(axis=1)", **kwds)
1.4421172500588
>>> timeit("W = weights.cumsum(); W.searchsorted(np.random.uniform(0, W[-1], 10))", **kwds)
0.3154504559934139

Timings part 2:

# assume we connect to 10% of the nodes
>>> neighbors = np.random.random(1000000) < 0.1
>>> 
# zeroing non connected weights
>>> timeit("W = np.where(neighbors, weights, 0).cumsum(); W.searchsorted(np.random.uniform(0, W[-1], 10))", **kwds)
0.553992060944438
# outright removing them
>>> timeit("nbidx, = np.where(neighbors); W = weights[nbidx].cumsum(); nbidx[W.searchsorted(np.random.uniform(0, W[-1], 10))]", **kwds)
0.32569816312752664

Upvotes: 1

Mark Borgerding
Mark Borgerding

Reputation: 8476

I've used something like the following. Use cumsum to form the weights into a cumulative distribution function and then sample from the inverse cdf.

wcs = weights.cumsum()
wcs = wcs / wcs[-1] # non-decreasing in (0:1]
u = np.random.uniform()
chosen = weights[(u < wcs).argmax()]  # the first index above u

Upvotes: 0

Related Questions