Filter numpy array of strings

Question

I have a very large data set gotten from twitter. I am trying to figure out how to do the equivalent of python filtering like the below in numpy. The environment is the python interpreter

>>tweets = [['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], 
         ['is nice man that buhari']]
>>>filter(lambda x: 'buhari' in x[0].lower(), tweets) 
[['buhari si good'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']]

I tried boolean indexing like the below, but the array turned up empty

>>>tweet_arr = np.array([['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']])
>>>flat_tweets = tweet_arr[:, 0]
>>>flat_tweets
array(['buhari si good', 'atiku is great', 'buhari nfd sdfa atiku',
   'is nice man that buhari'], dtype='|S23')
>>>flat_tweets['buhari' in flat_tweets]
array([], shape=(0, 4), dtype='|S23')

I would like to know how to filter strings in a numpy array, the way I was easily able to filter even numbers here

>>> arr = np.arange(15).reshape((15,1))
>>>arr
array([[ 0],
   [ 1],
   [ 2],
   [ 3],
   [ 4],
   [ 5],
   [ 6],
   [ 7],
   [ 8],
   [ 9],
   [10],
   [11],
   [12],
   [13],
   [14]])
>>>arr[:][arr % 2 == 0]
array([ 0,  2,  4,  6,  8, 10, 12, 14])

Thanks

fuglede · Accepted Answer

If you want to stick to a solution based entirely on NumPy, you could do

from numpy.core.defchararray import find, lower
tweet_arr[find(lower(tweet_arr), 'buhari') != -1]

You mention in a comment that what you're looking for here is performance, so it should be noted that this appears to be a good deal slower than the solution you came up with yourself:

In [33]: large_arr = np.repeat(tweet_arr, 10000)

In [36]: %timeit large_arr[find(lower(large_arr), 'buhari') != -1]
54.6 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [43]: %timeit list(filter(lambda x: 'buhari' in x.lower(), large_arr))
21.2 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact, an ordinary list comprehension beats both approaches:

In [44]: %timeit [x for x in large_arr if 'buhari' in x.lower()]
18.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Filter numpy array of strings

Answers (1)

Related Questions