Reputation: 15
I have two lists, both the same size, let's call them elements
and weights
. I want to choose one element of the elements
list with discrete probability distribution given by weights
. weight[i]
corresponds to the probability of choosing elements[i]
. elements
never changes, but after every sample taken, weights
changes (only the values, not the size).
I need an efficient way to do this with large lists.
I have an implementation in Python with numpy.random.choice(elements, p=weights)
but taking a sample of size k
from a set of size n
where k << n
is extremely inefficient. An implementation in any language is welcome, but I am working primarily in Python.
(This is used in a social network simulation with networkx. I have a weighted graph and a node i
and I want to choose a node j
from i
's neighbors where the probability for each node is proportional to the weight of the edge between i
and the given node. If I set the probability to 0 for non-neighbors, I don't have to generate the list of neighbors every time, I just need a list of all nodes.)
It will be used like this:
elements = [...]
weights = [...]
for(...):
element = sample(elements, weights)
*Some calculation with element and changing the values of weights*
Upvotes: 1
Views: 368
Reputation: 53029
@MarkBorgerding's approach is good but can be improved:
W = weights.cumsum()
W.searchsorted(np.random.uniform(0, W[-1], nsamples))
Also, it in the end depends on actual numbers, but instead of zeroing the probabilities of non neighbors it may well be more efficient to remove those probabilities; see Timings Part 2 below.
Timings:
1000000
options, single sample:
>>> from timeit import timeit
>>> kwds = dict(globals=globals(), number=100)
>>> weights = np.random.random(1000000)
>>>
>>> timeit("np.random.choice(1000000, 1, p=weights/weights.sum())", **kwds)
1.606048938119784
>>> timeit("W = weights.cumsum(); W/=W[-1]; (np.random.uniform()<W).argmax()", **kwds)
0.6634919850621372
>>> timeit("W = weights.cumsum(); W.searchsorted(np.random.uniform(0, W[-1]))", **kwds)
0.30993065400980413
1000000
options, 10
samples:
>>> timeit("np.random.choice(1000000, 10, p=weights/weights.sum())", **kwds)
1.606177378911525
>>> timeit("W = weights.cumsum(); W/=W[-1]; (np.random.uniform(0, 1, (10, 1))<W).argmax(axis=1)", **kwds)
1.4421172500588
>>> timeit("W = weights.cumsum(); W.searchsorted(np.random.uniform(0, W[-1], 10))", **kwds)
0.3154504559934139
Timings part 2:
# assume we connect to 10% of the nodes
>>> neighbors = np.random.random(1000000) < 0.1
>>>
# zeroing non connected weights
>>> timeit("W = np.where(neighbors, weights, 0).cumsum(); W.searchsorted(np.random.uniform(0, W[-1], 10))", **kwds)
0.553992060944438
# outright removing them
>>> timeit("nbidx, = np.where(neighbors); W = weights[nbidx].cumsum(); nbidx[W.searchsorted(np.random.uniform(0, W[-1], 10))]", **kwds)
0.32569816312752664
Upvotes: 1
Reputation: 8476
I've used something like the following. Use cumsum
to form the weights into a cumulative distribution function and then sample from the inverse cdf.
wcs = weights.cumsum()
wcs = wcs / wcs[-1] # non-decreasing in (0:1]
u = np.random.uniform()
chosen = weights[(u < wcs).argmax()] # the first index above u
Upvotes: 0