Reputation: 189
The numpy.random.choice method can generate a random sample without replacement if different elements should have different probabilities. However, when I test it with
import numpy
a = [0, 1, 2, 3, 4, 5]
p = [0.1, 0.3, 0.3, 0.1, 0.1, 0.1]
result = [0, 0, 0, 0, 0, 0]
N = 1000000
k = 3
for i in range(0, N):
temp = numpy.random.choice(a, k, False, p)
for j in temp:
result[j] += 1
for i in range(0, 6):
result[i] /= (N * k)
print(result)
the second and third elements only show up 25% of the time which is off by a lot. I tried different probability distributions (e.g., [0.1, 0.2, 0.3, 0.1, 0.1, 0.2]) and every time the result didn't match the expectation. Is there an issue with my code or is numpy really that inaccurate?
Upvotes: 0
Views: 877
Reputation: 12157
Your understanding of the np.random.choice
function is wrong. Specifically the replace=
option. The documentation suggests that replace=False
means that once an item has been chosen, it can't be chosen again. This can be shown by running
for _ in range(100):
assert set(np.random.choice(np.arange(5), 5, replace=False)) == set(range(5))
and seeing no error is ever raised. The order changes, but all 5 values must be returned.
Your current method is giving strange results because of this property. Even though 1 and 2 have a 0.3 chance of appearing as the first item, they have a less than 0.3 chance of appearing as the second or third item because if they were the first item, they can't be a later item.
The solution is obviously to use replace=True
(or ignore, True
is the default) like so:
import numpy as np
a = [0, 1, 2, 3, 4, 5]
p = [0.1, 0.3, 0.3, 0.1, 0.1, 0.1]
n = 100_000
choices = np.random.choice(a, n, p=p)
values, counts = np.unique(choices, return_counts=True)
result = dict(zip(values, counts / n))
# result == {0: 0.10063, 1: 0.30018, 2: 0.30003, 3: 0.09916, 4: 0.10109, 5: 0.09891}
Upvotes: 2