Reputation: 3635
I've written a function which pulls the elements from non-uniform distribution and returns indices of elements of input array, as if they were pulled from uniform distribution. Here is the code and an example:
import numpy as np
def uniform_choice(x, n):
unique, counts = np.unique(x, return_counts=True)
element_freq = np.zeros(x.shape)
for i in range(len(unique)):
element_freq[np.where(x == unique[i])[0]] = counts[i]
p = 1/element_freq/(1/element_freq).sum()
return np.random.choice(x, n, False, p)
x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))
#so this gives an non-uniform distribution of elements "a", "b", "c", "d"
np.unique(x, return_counts=True)
#returns
(array(['a', 'b', 'c', 'd'], dtype='<U1'),
array([10082, 19888, 30231, 39799]))
Using my function I can pull elements from this distribution and get indices as if they were pulled from uniform distribution:
np.unique(uniform_choice(x, 5000), return_counts=True)
#returns
array([23389, 90961, 78455, ..., 41405, 22894, 79686])
Is it possible to avoid a for loop in my function. I need to do this sampling a lot of times on very large arrays, so this becomes slow. I believe that vectorized version of comparison would give me much faster results.
Upvotes: 1
Views: 125
Reputation: 221634
You can kill the loopy portion which I am assuming is the most time consuming part there by extending the use of np.unique
to incorporate return_inverse=True
that would give us unique numeric labels for each unique string in x
. Those numeric labels could then be used as indices to lead us to a vectorized calculation for element_freq
. Thus, the loopy portion -
unique, counts = np.unique(x, return_counts=True)
element_freq = np.zeros(x.shape)
for i in range(len(unique)):
element_freq[np.where(x == unique[i])[0]] = counts[i]
would be replaced by -
unique, idx, counts = np.unique(x, return_inverse=True, return_counts=True)
element_freq = counts[idx]
Runtime tests -
In [18]: x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))
In [19]: %%timeit
...: unique, counts = np.unique(x, return_counts=True)
...: element_freq = np.zeros(x.shape)
...: for i in range(len(unique)):
...: element_freq[np.where(x == unique[i])[0]] = counts[i]
...:
100 loops, best of 3: 18.9 ms per loop
In [20]: %%timeit
...: unique, idx, counts =np.unique(x,return_inverse=True, return_counts=True)
...: element_freq = counts[idx]
...:
100 loops, best of 3: 12.9 ms per loop
Upvotes: 2
Reputation: 249464
Maybe something like this (untested):
def uniform_choice(x, n):
unique = np.unique(x)
values = np.random.choice(x, n, False)
return np.searchsorted(x, values, sorter=np.argsort(x))
This generates n
values from the unique set, then uses searchsorted
to look for those values in the original array, and returns their indexes.
The one difference I expect from this approach is that you will only get the first index in x
where each value appears. That is, a value which appears in x
multiple times will always be represented by the index of a single one of its occurrences, whereas in your original code it could be multiple.
Upvotes: 0