enedene
enedene

Reputation: 3635

Vectorize comparisons

I've written a function which pulls the elements from non-uniform distribution and returns indices of elements of input array, as if they were pulled from uniform distribution. Here is the code and an example:

import numpy as np

def uniform_choice(x, n):
    unique, counts = np.unique(x, return_counts=True)
    element_freq = np.zeros(x.shape)
    for i in range(len(unique)):
       element_freq[np.where(x == unique[i])[0]] = counts[i]
    p = 1/element_freq/(1/element_freq).sum()
    return np.random.choice(x, n, False, p)

x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))
#so this gives an non-uniform distribution of elements "a", "b", "c", "d"
np.unique(x, return_counts=True)

#returns
(array(['a', 'b', 'c', 'd'], dtype='<U1'), 
array([10082, 19888, 30231, 39799]))

Using my function I can pull elements from this distribution and get indices as if they were pulled from uniform distribution:

np.unique(uniform_choice(x, 5000), return_counts=True)
#returns
array([23389, 90961, 78455, ..., 41405, 22894, 79686])

Is it possible to avoid a for loop in my function. I need to do this sampling a lot of times on very large arrays, so this becomes slow. I believe that vectorized version of comparison would give me much faster results.

Upvotes: 1

Views: 125

Answers (2)

Divakar
Divakar

Reputation: 221634

You can kill the loopy portion which I am assuming is the most time consuming part there by extending the use of np.unique to incorporate return_inverse=True that would give us unique numeric labels for each unique string in x. Those numeric labels could then be used as indices to lead us to a vectorized calculation for element_freq. Thus, the loopy portion -

unique, counts = np.unique(x, return_counts=True)
element_freq = np.zeros(x.shape)
for i in range(len(unique)):
   element_freq[np.where(x == unique[i])[0]] = counts[i]

would be replaced by -

unique, idx, counts = np.unique(x, return_inverse=True, return_counts=True)
element_freq = counts[idx]

Runtime tests -

In [18]: x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))

In [19]: %%timeit 
    ...: unique, counts = np.unique(x, return_counts=True)
    ...: element_freq = np.zeros(x.shape)
    ...: for i in range(len(unique)):
    ...:    element_freq[np.where(x == unique[i])[0]] = counts[i]
    ...: 
100 loops, best of 3: 18.9 ms per loop

In [20]: %%timeit 
    ...: unique, idx, counts =np.unique(x,return_inverse=True, return_counts=True)
    ...: element_freq = counts[idx]
    ...: 
100 loops, best of 3: 12.9 ms per loop

Upvotes: 2

John Zwinck
John Zwinck

Reputation: 249464

Maybe something like this (untested):

def uniform_choice(x, n):
    unique = np.unique(x)
    values = np.random.choice(x, n, False)
    return np.searchsorted(x, values, sorter=np.argsort(x))

This generates n values from the unique set, then uses searchsorted to look for those values in the original array, and returns their indexes.

The one difference I expect from this approach is that you will only get the first index in x where each value appears. That is, a value which appears in x multiple times will always be represented by the index of a single one of its occurrences, whereas in your original code it could be multiple.

Upvotes: 0

Related Questions