Reputation: 85

Find indices in numpy arrays consisting of lists where element is in list

So I have a numpy array containing lists (of various lengths) which consist of strings. I'd like to find all rows where a certain string appears in, e.g. I want to find all numpy array indices for which the corresponding list contains the string 'hello'. I thought I could simply use

np.where('hello' in np_array)

but unfortunately this just results in an empty numpy array. Any ideas?

Upvotes: 2

Answers (3)

alani

Reputation: 13079

Extending the answer about np.vectorize in order to answer the additional question about returning also the index within each list (asked by the OP as a comment under the accepted answer), you could perhaps define a function which returns the index number or -1, and then vectorize that. You can then post-process the return from this vectorized function, to obtain both types of required indices.

import numpy as np

myarr = np.array([['foo', 'bar', 'baz'],
                  ['quux'],
                  ['baz', 'foo']])

def get_index(val, lst):
    "return the index in a list, or -1 if the item is not present"
    try:
        return lst.index(val)
    except ValueError:
        return -1

func = lambda x:get_index('foo', x)

list_indices = np.vectorize(func)(myarr)  # [0 -1 1]
valid = (list_indices >= 0)  # [True False True]

array_indices = np.where(valid)[0]   # [0 2]
valid_list_indices = list_indices[valid]  # [0 1]

print(np.stack([array_indices, valid_list_indices]).T)
# [[0 0]    <== list 0, element 0
#  [2 1]]   <== list 2, element 1

Upvotes: 0

Camilo Martínez M.

Reputation: 1620

Following @aminrd's answer, you can also use np.isin instead of Python's in, which gives you the benefit of returning a boolean numpy array representing where the string hello appears.

import numpy as np

myarray = np.array(
    [["hello", "salam", "bonjour"], ["a", "b", "c"], ["hello"]], dtype=object
)

ids = np.frompyfunc(lambda x: np.isin(x, "hello"), 1, 1)(myarray)

idxs = [(i, np.where(curr)[0][0]) for i, curr in enumerate(ids) if curr.any()]

Result:

>>> print(ids)
    [array([ True, False, False]) array([False, False, False]) array([ True])]
>>> print(idxs)
    [(0, 0), (2, 0)]

EDIT: If you want to avoid the explicit loop, you could pad the array with 0 (same as False) and then use numpy's broadcasting normally (this is necessary since ids becomes an object array with shape (3,))

>>> padded_ids = np.column_stack((itertools.zip_longest(*ids, fillvalue=0)))
>>> print(np.stack(np.where(padded_ids), axis=1))
    [[0 0]
     [2 0]]

Keep in mind padding methods usually have some kind of a loop somewhere, so I don't think you can totally get away from it.

Upvotes: 1

aminrd

Reputation: 5070

import numpy as np
np_arr = np.array([['hello', 'salam', 'bonjour'], ['a', 'b', 'c'], ['hello']])

vec_func = np.vectorize(lambda x: 'hello' in x)
ind = vec_func(np_arr)

Output:

#Ind: 
array([ True, False,  True])

# np_arr[ind]:
array([list(['hello', 'salam', 'bonjour']), list(['hello'])], dtype=object)

However, if you wish to get the output as a list of integers for indices, you might use:

np.where(vec_func(np_arr))

#(array([0, 2], dtype=int64),)

Upvotes: 4

Find indices in numpy arrays consisting of lists where element is in list

Answers (3)

Related Questions