Vectorize comparisons

Question

I've written a function which pulls the elements from non-uniform distribution and returns indices of elements of input array, as if they were pulled from uniform distribution. Here is the code and an example:

import numpy as np

def uniform_choice(x, n):
    unique, counts = np.unique(x, return_counts=True)
    element_freq = np.zeros(x.shape)
    for i in range(len(unique)):
       element_freq[np.where(x == unique[i])[0]] = counts[i]
    p = 1/element_freq/(1/element_freq).sum()
    return np.random.choice(x, n, False, p)

x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))
#so this gives an non-uniform distribution of elements "a", "b", "c", "d"
np.unique(x, return_counts=True)

#returns
(array(['a', 'b', 'c', 'd'], dtype='



Using my function I can pull elements from this distribution and get indices as if they were pulled from uniform distribution:

np.unique(uniform_choice(x, 5000), return_counts=True)
#returns
array([23389, 90961, 78455, ..., 41405, 22894, 79686])


Is it possible to avoid a for loop in my function. I need to do this sampling a lot of times on very large arrays, so this becomes slow. I believe that vectorized version of comparison would give me much faster results.

Divakar · Accepted Answer

You can kill the loopy portion which I am assuming is the most time consuming part there by extending the use of np.unique to incorporate return_inverse=True that would give us unique numeric labels for each unique string in x. Those numeric labels could then be used as indices to lead us to a vectorized calculation for element_freq. Thus, the loopy portion -

unique, counts = np.unique(x, return_counts=True)
element_freq = np.zeros(x.shape)
for i in range(len(unique)):
   element_freq[np.where(x == unique[i])[0]] = counts[i]

would be replaced by -

unique, idx, counts = np.unique(x, return_inverse=True, return_counts=True)
element_freq = counts[idx]

Runtime tests -

In [18]: x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))

In [19]: %%timeit 
    ...: unique, counts = np.unique(x, return_counts=True)
    ...: element_freq = np.zeros(x.shape)
    ...: for i in range(len(unique)):
    ...:    element_freq[np.where(x == unique[i])[0]] = counts[i]
    ...: 
100 loops, best of 3: 18.9 ms per loop

In [20]: %%timeit 
    ...: unique, idx, counts =np.unique(x,return_inverse=True, return_counts=True)
    ...: element_freq = counts[idx]
    ...: 
100 loops, best of 3: 12.9 ms per loop

Vectorize comparisons

Answers (2)

Related Questions