aerin
aerin

Reputation: 22634

pandas dataframe str.contains() AND operation

I have a df (Pandas Dataframe) with three rows:

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

The function df.col_name.str.contains("apple|banana") will catch all of the rows:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH "apple" & "banana"?

"apple and banana both are delicious"

I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)

Upvotes: 55

Views: 114570

Answers (10)

Vaibhav Gupta
Vaibhav Gupta

Reputation: 41

You can create masks

apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]

Upvotes: 3

Jonny
Jonny

Reputation: 51

From @Anzel's answer, I wrote a function since I'm going to be applying this a lot:

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

So if you have words defined:

words = ['apple', 'banana']

And then call it with something like:

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

then you should get what you're after.

Upvotes: 1

Sergey Zakharov
Sergey Zakharov

Reputation: 1603

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

Upvotes: 7

Charan Reddy
Charan Reddy

Reputation: 524

This works

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

Upvotes: 13

pault
pault

Reputation: 43504

Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).

For example, consider the following DataFrame:

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

Suppose we wanted to search for all of the following:

targets = ['apple', 'banana', 'strawberry']

We can do:

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

Upvotes: 3

Siraj S.
Siraj S.

Reputation: 3751

if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

output:

                                   col
2  apple and banana both are delicious

if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

output:

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

Upvotes: 3

Alexander
Alexander

Reputation: 109546

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

Upvotes: 39

pmaniyan
pmaniyan

Reputation: 1046

Try this regex

apple.*banana|banana.*apple

Code is:

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

Output

   ID                           String_Col
2   3  apple and banana both are delicious

Upvotes: 4

Anzel
Anzel

Reputation: 20553

You can also do it in regex expression style:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

Upvotes: 47

flyingmeatball
flyingmeatball

Reputation: 7997

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

Upvotes: 62

Related Questions