Reputation: 161
Similar to Numpy random choice to produce a 2D-array with all unique values, I am looking for an efficient way of generating:
n = 1000
k = 10
number_of_combinations = 1000000
p = np.random.rand(n)
p /= np.sum(p)
my_combinations = np.random.choice(n, size=(number_of_combinations, k), replace=False, p=p)
As discussed in the previous question, I want this matrix to have only unique rows. Unfortunately, the provided solutions do not work for the additional extension of using specific probabilities p.
My current solution is as follows:
my_combinations = set()
while len(my_combinations) < number_of_combinations:
new_combination = np.random.choice(n, size=k, replace=False, p=p)
my_combinations.add(frozenset(new_combination))
print(my_combinations)
However, I do think that there should be a more efficient numpy approach to solve this faster.
Upvotes: 2
Views: 395
Reputation: 11602
For these parameter values, the probability of encountering a duplicate row is astronomically small (unless p
is very skewed, perhaps to the extent that cannot be accommodated by float precision). I would just use
my_combinations = np.random.choice(n, size=number_of_combinations, k), replace=True, p=p)
You can check for duplicates in O(N log N)
where N = number_of_combinations
;
Conservatively, you could generate
my_combinations = np.random.choice(n, size=2 * number_of_combinations, k), replace=True, p=p)
then drop duplicates and take the first number_of_combinations
rows.
Upvotes: 2