Reputation: 22634
I have a df (Pandas Dataframe) with three rows:
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
The function df.col_name.str.contains("apple|banana")
will catch all of the rows:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
How do I apply AND operator to the str.contains()
method, so that it only grabs strings that contain BOTH "apple" & "banana"?
"apple and banana both are delicious"
I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)
Upvotes: 55
Views: 114570
Reputation: 41
You can create masks
apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]
Upvotes: 3
Reputation: 51
From @Anzel's answer, I wrote a function since I'm going to be applying this a lot:
def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
return base.format(''.join(expr.format(w) for w in words))
So if you have words
defined:
words = ['apple', 'banana']
And then call it with something like:
dg = df.loc[
df['col_name'].str.contains(regify(words), case=False, regex=True)
]
then you should get what you're after.
Upvotes: 1
Reputation: 1603
If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:
targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
Upvotes: 7
Reputation: 524
This works
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
Upvotes: 13
Reputation: 43504
Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce()
and the bitwise AND operator (&
).
For example, consider the following DataFrame:
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"]})
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
Suppose we wanted to search for all of the following:
targets = ['apple', 'banana', 'strawberry']
We can do:
#from functools import reduce # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])
# col
#3 i love apple, banana, and strawberry
Upvotes: 3
Reputation: 3751
if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :
target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
output:
col
2 apple and banana both are delicious
if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
output:
col
2 apple and banana both are delicious
3 orange,banana and apple all are delicious
Upvotes: 3
Reputation: 109546
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
Upvotes: 39
Reputation: 1046
Try this regex
apple.*banana|banana.*apple
Code is:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
Output
ID String_Col
2 3 apple and banana both are delicious
Upvotes: 4
Reputation: 20553
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
Upvotes: 47
Reputation: 7997
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
Upvotes: 62