Reputation: 219
I'm trying to sample 1000 numbers between 0 and 999, with a vector of weights dictating the probability that a particular number will be chosen:
import numpy as np
resampled_indices = np.random.choice(a = 1000, size = 1000, replace = True, p = weights)
Unfortunately, this process has to be run thousands of times in a larger for loop, and it seems that np.random.choice
is the main speed bottleneck in the process. As such, I was wondering if there's any way to speed up np.random.choice
or to use an alternative method that gives the same results.
Upvotes: 8
Views: 6962
Reputation: 516
For smaller sample sizes I find that the python 3.6 function, random.choices
, is faster. In the script below, the break-even is at a sample size of 99, and random.choices
becomes increasingly faster than 'numpy.random.choice' as the sample size decreases. With no weights, the break-even number is slightly higher at 120. However, for a population size of 1000, random.choices
is about 3 times slower with weights, and 7 times slower without weights.
import numpy as np
import time
import random
SIZE = 98
def numpy_choice():
for count in range(10000):
resampled_indices = np.random.choice(a=population_array, size=SIZE, replace=True, p=weights)
return
def python_choices():
for count in range(10000):
resampled_indices = random.choices(population_list, weights=weights, k=SIZE)
return
if __name__ == '__main__':
weights = [1/SIZE for i in range(SIZE)]
population_array = np.arange(SIZE)
population_list = list(population_array)
start_time = time.time()
numpy_choice()
end_time = time.time()
print('numpy.choice time:', end_time-start_time)
# gave 0.299
start_time = time.time()
python_choices()
end_time = time.time()
print('python random.choices time:', end_time-start_time)
# gave 0.296
Upvotes: 0
Reputation: 1235
It seems you can do slightly faster by using a uniform sampling and then "inverting" the cumulative distrubtion using np.searchsorted
:
# assume arbitrary probabilities
weights = np.random.randn(1000)**2
weights /= weights.sum()
def weighted_random(w, n):
cumsum = np.cumsum(w)
rdm_unif = np.random.rand(n)
return np.searchsorted(cumsum, rdm_unif)
# first method
%timeit np.random.choice(a = 1000, size = 1000, replace = True, p = weights)
# 10000 loops, best of 3: 220 µs per loop
# proposed method
%timeit weighted_random(weights, n)
# 10000 loops, best of 3: 158 µs per loop
Now we can check empirically that the probabilities are correct:
samples =np.empty((10000,1000),dtype=int)
for i in xrange(10000):
samples[i,:] = weighted_random(weights)
empirical = 1. * np.bincount(samples.flatten()) / samples.size
((empirical - weights)**2).max()
# 3.5e-09
Upvotes: 4