Reputation: 85
So I have a numpy array containing lists (of various lengths) which consist of strings. I'd like to find all rows where a certain string appears in, e.g. I want to find all numpy array indices for which the corresponding list contains the string 'hello'. I thought I could simply use
np.where('hello' in np_array)
but unfortunately this just results in an empty numpy array. Any ideas?
Upvotes: 2
Views: 1959
Reputation: 13079
Extending the answer about np.vectorize
in order to answer the additional question about returning also the index within each list (asked by the OP as a comment under the accepted answer), you could perhaps define a function which returns the index number or -1, and then vectorize that. You can then post-process the return from this vectorized function, to obtain both types of required indices.
import numpy as np
myarr = np.array([['foo', 'bar', 'baz'],
['quux'],
['baz', 'foo']])
def get_index(val, lst):
"return the index in a list, or -1 if the item is not present"
try:
return lst.index(val)
except ValueError:
return -1
func = lambda x:get_index('foo', x)
list_indices = np.vectorize(func)(myarr) # [0 -1 1]
valid = (list_indices >= 0) # [True False True]
array_indices = np.where(valid)[0] # [0 2]
valid_list_indices = list_indices[valid] # [0 1]
print(np.stack([array_indices, valid_list_indices]).T)
# [[0 0] <== list 0, element 0
# [2 1]] <== list 2, element 1
Upvotes: 0
Reputation: 1620
Following @aminrd's answer, you can also use np.isin
instead of Python's in
, which gives you the benefit of returning a boolean numpy array representing where the string hello
appears.
import numpy as np
myarray = np.array(
[["hello", "salam", "bonjour"], ["a", "b", "c"], ["hello"]], dtype=object
)
ids = np.frompyfunc(lambda x: np.isin(x, "hello"), 1, 1)(myarray)
idxs = [(i, np.where(curr)[0][0]) for i, curr in enumerate(ids) if curr.any()]
Result:
>>> print(ids)
[array([ True, False, False]) array([False, False, False]) array([ True])]
>>> print(idxs)
[(0, 0), (2, 0)]
EDIT: If you want to avoid the explicit loop, you could pad the array with 0 (same as False
) and then use numpy's broadcasting normally (this is necessary since ids
becomes an object
array with shape (3,)
)
>>> padded_ids = np.column_stack((itertools.zip_longest(*ids, fillvalue=0)))
>>> print(np.stack(np.where(padded_ids), axis=1))
[[0 0]
[2 0]]
Keep in mind padding methods usually have some kind of a loop somewhere, so I don't think you can totally get away from it.
Upvotes: 1
Reputation: 5070
import numpy as np
np_arr = np.array([['hello', 'salam', 'bonjour'], ['a', 'b', 'c'], ['hello']])
vec_func = np.vectorize(lambda x: 'hello' in x)
ind = vec_func(np_arr)
Output:
#Ind:
array([ True, False, True])
# np_arr[ind]:
array([list(['hello', 'salam', 'bonjour']), list(['hello'])], dtype=object)
However, if you wish to get the output as a list of integers for indices, you might use:
np.where(vec_func(np_arr))
#(array([0, 2], dtype=int64),)
Upvotes: 4