Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

Compare numpy arrays whose entries are strings, and find locations of strings

I have two numpy arrays whose entries are strings. The first array (array1) is of shape ( m, n ) where m>1 and n>1. The second array (array2) is of shape (p, ), where p is an integer greater than 1. Entries in array2 are not repeated (i.e. they are unique), while array1 is likely to have multiple instances of same strings.

I want to replace array1 with another array of the same shape (as array1), by including indices (numbers) in place of strings. These indices are obtained by comparing the entries of array1 with array2. Each entry of array1 will surely match with some entry of array2.

Speed is of importance here, and I want to find the fastest way of doing this.

Here is a small example:

import numpy as np

array1 = np.asarray([['aa', 'cc', 'bb', 'aa', 'aa', 'bb'],
                   ['cc', 'bb', 'cc', 'bb', 'aa', 'aa'],
                   ['bb', 'cc', 'aa', 'aa', 'bb', 'cc']])

array2 = np.asarray(['aa', 'bb', 'cc'])

This is how I am approaching the problem for now:

for k in range(array1.shape[0]):
    array1[k] = np.asarray([j for i in range(array1.shape[1]) for j in range(len(array2)) if array1[k,i]==array2[j]]) 

print array1

[['0' '2' '1' '0' '0' '1']
 ['2' '1' '2' '1' '0' '0']
 ['1' '2' '0' '0' '1' '2']]

But, when I work with array1 with huge numbers of rows and columns, I find that the above mentioned way is not very fast.

What may be a faster way of achieving the task that I desire?

Upvotes: 0

Views: 47

Answers (2)

Dani Mesejo
Dani Mesejo

Reputation: 61910

A possible alternative:

import numpy as np

array1 = np.asarray([['aa', 'cc', 'bb', 'aa', 'aa', 'bb'],
                     ['cc', 'bb', 'cc', 'bb', 'aa', 'aa'],
                     ['bb', 'cc', 'aa', 'aa', 'bb', 'cc']])

array2 = np.asarray(['aa', 'bb', 'cc'])

d = {v: k for k, v in enumerate(array2)}
result = np.vectorize(d.get)(array1)

print(result)

Output

[[0 2 1 0 0 1]
 [2 1 2 1 0 0]
 [1 2 0 0 1 2]]

Upvotes: 1

Divakar
Divakar

Reputation: 221524

With all entries from array2 present in array, we can use np.searchsorted -

sidx = array2.argsort()
out = sidx[np.searchsorted(array2,array1.ravel(),sorter=sidx).reshape(array1.shape)]

If array2 is already sorted, we can skip argsort and corresponding indexing step -

out = np.searchsorted(array2,array1.ravel()).reshape(array1.shape)

Upvotes: 3

Related Questions