Reputation: 351
My question is kind of an extension of the question answered quite well in this link:
I've posted the answer here below where the strings are filtered out when they contain the word "ball":
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4
Now my question is: what if I have long sentences in my data, and I want to identify strings with the words "ball" AND "field"? So that it throws away data that contains the word "ball" or "field" when only one of them occur, but keeps the ones where the string has both words in it.
Upvotes: 6
Views: 7380
Reputation: 323316
If you have more than 2 , You can using this ..(Notice the speed is not as good as foxyblue's method )
l = ['ball', 'field']
df.ids.apply(lambda x: all(y in x for y in l))
Upvotes: 2
Reputation: 210882
Yet another RegEx approach:
In [409]: df
Out[409]:
ids
0 ball and field
1 ball, just ball
2 field alone
3 field and ball
In [410]: pat = r'(?:ball.*field|field.*ball)'
In [411]: df[df['ids'].str.contains(pat)]
Out[411]:
ids
0 ball and field
3 field and ball
Upvotes: 0
Reputation: 76947
You could use np.logical_and.reduce
and str.contains
takes care of multiple words.
df[np.logical_and.reduce([df['ids'].str.contains(w) for w in ['ball', 'field']])]
In [96]: df
Out[96]:
ids
0 ball is field
1 ball is wa
2 doll is field
In [97]: df[np.logical_and.reduce([df['ids'].str.contains(w) for w in ['ball', 'field']])]
Out[97]:
ids
0 ball is field
Upvotes: 0
Reputation: 3049
df[df['ids'].str.contains("ball")]
Would become:
df[df['ids'].str.contains("ball") & df['ids'].str.contains("field")]
If you are into neater code:
contains_balls = df['ids'].str.contains("ball")
contains_fields = df['ids'].str.contains("field")
filtered_df = df[contains_balls & contains_fields]
Upvotes: 5