R overflow
R overflow

Reputation: 1352

Check if Multiple Strings are present in a DataFrame Column

I would like to check if items in a List are in a Column from my DF.

The basics where straightforward:

fruit = ['apple','banana']    # This items should be in the column 
fruit = ', '.join(fruit)      # Think this is the point where it goes wrong... 

fruit_resulst = df['all_fruit'].str.contains(fruit) # Check if column contains fruit 
df_new = df[fruit_resulst]   # Filter so that we only keep the TRUEs 

This works, but not completely. It only works in this specific order, but I would like to have it working in all orders (e.g., if a column row contains ALL items from the list, then I would like to keep them. Else, remove.

df['all_fruit']

Apple, Banana             #Return! Because it contains apple and banana
Banana                    # Do not return 
Banana, Apple             #Return! Because it contains apple and banana    
Apple                     # Do not return
Apple, Banana, Peer       #Return! Because it contains apple and banana

Thanks a lot in advance!

Upvotes: 2

Views: 975

Answers (3)

jezrael
jezrael

Reputation: 862511

Convert values to lowercase, then split to lists and test issubset by convert fruit to set:

df1 = df[df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Your solution with list of boolean masks passed to np.logical_and.reduce:

df1 = df[np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71570

Try this code:

x = df['all_fruit'].str.split(',', expand=True)
print(df[x.replace('Apple', '').ne(x).any(1) & x.replace(' Banana', '').ne(x).any(1)])

Output:

             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Upvotes: 1

Behzad Shayegh
Behzad Shayegh

Reputation: 333

df = pd.DataFrame({'all_fruit': [
    'Apple, Banana',
    'Banana',
    'Banana, Apple',
    'Apple',
    'Apple, Banana, Peer',
]})
fruit = ['apple','banana']
have_fruits = [df.all_fruit.str.contains(f, case=False) for f in fruit]
indexes = True
for f in have_fruits:
    indexes = indexes * f
df[indexes]

Upvotes: 1

Related Questions