match non-unique, un-sorted array to indexes in unique, sorted array

Question

I have a sorted, unique numpy character array:

import numpy as np
vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f'])

I have another, unsorted array (I actually have millions of these):

sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])

This second array is much smaller than the first array, and also potentially contains values not in the original array.

What I want to do is match the values in the second array to their corresponding indexes, returning nan or some special value for non-matches.

e.g.:

sentence_idx = np.asarray([2, 1, 2, 1, 2, np.nan])

I've tried a couple different iterations of a matching function with np.in1d, but it always seems to break down on sentences that contain repeated words.

I've also tried a couple of different list comprehensions, but they're too slow to run on my collection of millions of sentences.

So, what's the best way to accomplish this in numpy? In R, I'd use the match function, but there seems to be no numpy equivalent.

Divakar · Accepted Answer

You can use a nifty tool for such searches np.searchsorted, like so -

# Store matching indices of 'sentence' in 'vocab' when "left-searched"
out = np.searchsorted(vocab,sentence,'left').astype(float)

# Get matching indices of 'sentence' in 'vocab' when "right-searched".
# Now, the trick is that non-matches won't have any change between left 
# and right searches. So, compare these two searches and look for the 
# unchanged ones, which are the invalid ones and set them as NaNs.
right_idx = np.searchsorted(vocab,sentence,'right')
out[out == right_idx] = np.nan

Sample run -

In [17]: vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f']) 
    ...: sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])
    ...: 

In [18]: out = np.searchsorted(vocab,sentence,'left').astype(float)
    ...: right_idx = np.searchsorted(vocab,sentence,'right')
    ...: out[out == right_idx] = np.nan
    ...: 

In [19]: out
Out[19]: array([  2.,   1.,   2.,   1.,   2.,  nan])

match non-unique, un-sorted array to indexes in unique, sorted array

Answers (1)

Related Questions