Zach
Zach

Reputation: 30311

match non-unique, un-sorted array to indexes in unique, sorted array

I have a sorted, unique numpy character array:

import numpy as np
vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f']) 

I have another, unsorted array (I actually have millions of these):

sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z']) 

This second array is much smaller than the first array, and also potentially contains values not in the original array.

What I want to do is match the values in the second array to their corresponding indexes, returning nan or some special value for non-matches.

e.g.:

sentence_idx = np.asarray([2, 1, 2, 1, 2, np.nan]) 

I've tried a couple different iterations of a matching function with np.in1d, but it always seems to break down on sentences that contain repeated words.

I've also tried a couple of different list comprehensions, but they're too slow to run on my collection of millions of sentences.

So, what's the best way to accomplish this in numpy? In R, I'd use the match function, but there seems to be no numpy equivalent.

Upvotes: 2

Views: 163

Answers (1)

Divakar
Divakar

Reputation: 221584

You can use a nifty tool for such searches np.searchsorted, like so -

# Store matching indices of 'sentence' in 'vocab' when "left-searched"
out = np.searchsorted(vocab,sentence,'left').astype(float)

# Get matching indices of 'sentence' in 'vocab' when "right-searched".
# Now, the trick is that non-matches won't have any change between left 
# and right searches. So, compare these two searches and look for the 
# unchanged ones, which are the invalid ones and set them as NaNs.
right_idx = np.searchsorted(vocab,sentence,'right')
out[out == right_idx] = np.nan

Sample run -

In [17]: vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f']) 
    ...: sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])
    ...: 

In [18]: out = np.searchsorted(vocab,sentence,'left').astype(float)
    ...: right_idx = np.searchsorted(vocab,sentence,'right')
    ...: out[out == right_idx] = np.nan
    ...: 

In [19]: out
Out[19]: array([  2.,   1.,   2.,   1.,   2.,  nan])

Upvotes: 3

Related Questions