Felix
Felix

Reputation: 161

Numpy random choice with probabilities to produce a 2D-array with unique rows

Similar to Numpy random choice to produce a 2D-array with all unique values, I am looking for an efficient way of generating:

n = 1000
k = 10
number_of_combinations = 1000000

p = np.random.rand(n)
p /= np.sum(p)

my_combinations = np.random.choice(n, size=(number_of_combinations, k), replace=False, p=p)

As discussed in the previous question, I want this matrix to have only unique rows. Unfortunately, the provided solutions do not work for the additional extension of using specific probabilities p.

My current solution is as follows:

my_combinations = set()

while len(my_combinations) < number_of_combinations:
    new_combination = np.random.choice(n, size=k, replace=False, p=p)
    my_combinations.add(frozenset(new_combination))

print(my_combinations)

However, I do think that there should be a more efficient numpy approach to solve this faster.

Upvotes: 2

Views: 395

Answers (1)

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

For these parameter values, the probability of encountering a duplicate row is astronomically small (unless p is very skewed, perhaps to the extent that cannot be accommodated by float precision). I would just use

my_combinations = np.random.choice(n, size=number_of_combinations, k), replace=True, p=p)

You can check for duplicates in O(N log N) where N = number_of_combinations;

Conservatively, you could generate

my_combinations = np.random.choice(n, size=2 * number_of_combinations, k), replace=True, p=p)

then drop duplicates and take the first number_of_combinations rows.

Upvotes: 2

Related Questions