Reputation: 572
I am using the function numpy.random.choice for generating random samples at once. But I'd like all the samples to be different. Is somebody aware of a function doing this? Explicitly, I'd like to have this:
import numpy as np
a = np.random.choice(62, size=(1000000, 8))
assert( len(set([tuple(a[i]) for i in range(a.shape[0])])) == a.shape[0])
The values on the integers can be replaced. The only which is required is that all row entries to be different.
Upvotes: 0
Views: 1682
Reputation: 9796
First things first, if you have a numpy version >= 1.17 avoid using np.random.choice
for the recommended method:
rng = np.random.default_rng()
rng.choice
Each sample has 8 values and for max_value = 62
you have 62**8 unique samples. Wanting to get just 1 million of them means 99.8% of the time they will all be unique in one draw according to the birtday problem. In this case it suffices to generate the whole array and do a simple check.
samples = 1000000
while True:
a = np.random.choice(62, size=(samples, 8))
# Credit to Mark Dickinson, this is faster than doing
# `len(set(tuple(row) for row in a)) == samples`
if np.unique(a, axis=0).shape[0] == samples:
break
For lower values of max_value
(less than 30) you may generate duplicates with enough frequency/certainty that the above approach may become ineffecient or even an infinite loop. It is then better to generate the whole array, keep any unique samples in a set and generate however many more you require. Iterate this process until you have as many as you need.
seen = set()
a = []
while len(a) < samples:
draws = np.random.choice(62, size=(samples-len(a), 8))
for draw in draws:
if t := tuple(draw) not in seen:
seen.add(t)
a.append(draw)
a = np.array(a)
This assumes the number of samples you want to draw is much smaller than the tolar number of unique samples. If for example the total was 1001 samples and you wanted to draw 1000, this approach would quickly become inefficient.
Upvotes: 5