Reputation: 155
I have a integer that needs to be split up in to bins according to a probability distribution. For example, if I had N=100
objects going into [0.02, 0.08, 0.16, 0.29, 0.45]
then you might get [1, 10, 20, 25, 44]
.
import numpy as np
# sample distribution
d = np.array([x ** 2 for x in range(1,6)], dtype=float)
d = d / d.sum()
dcs = d.cumsum()
bins = np.zeros(d.shape)
N = 100
for roll in np.random.rand(N):
# grab the first index that the roll satisfies
i = np.where(roll < dcs)[0][0]
bins[i] += 1
In reality, N and my number of bins are very large, so looping isn't really a viable option. Is there any way I can vectorize this operation to speed it up?
Upvotes: 6
Views: 1810
Reputation: 143194
One more way to do this:
import numpy as np
p = [0.02, 0.08, 0.16, 0.29, 0.45]
np.bincount(np.random.choice(range(len(p)), size=100, p=p), minlength=len(p))
# array([ 1, 6, 16, 25, 52])
It seems like allocating a length-100 array should be unnecessary but I haven't seen a way in numpy to avoid it.
Upvotes: 0
Reputation: 221624
You can use np.bincount
for a binning operation alongwith np.searchsorted
to perform the equivalent of roll < dcs
operation. Here's an implementation to fulfill these promises -
bins = np.bincount(np.searchsorted(dcs,np.random.rand(N),'right'))
Runtime test using given parameters -
In [72]: %%timeit
...: for roll in np.random.rand(N):
...: # grab the first index that the roll satisfies
...: i = np.where(roll < dcs)[0][0]
...: bins[i] += 1
...:
1000 loops, best of 3: 721 µs per loop
In [73]: %%timeit
...: np.bincount(np.searchsorted(dcs,np.random.rand(N),'right'))
...:
100000 loops, best of 3: 13.5 µs per loop
Upvotes: 2
Reputation: 74222
You can convert your PDF to a CDF by taking the cumsum, use this to define a set of bins between 0 and 1, then use these bins to compute the histogram of an N-long random uniform vector:
cdf = np.cumsum([0, 0.02, 0.08, 0.16, 0.29, 0.45]) # leftmost bin edge = 0
counts, edges = np.histogram(np.random.rand(100), bins=cdf)
print(counts)
# [ 4, 8, 16, 30, 42]
Upvotes: 5