jason
jason

Reputation: 4439

Pandas filter by more than one "contains" for not one cell but entire column

I have a bunch of dataframes, and I want to find the dataframes that contains both the words i specify. For example, I want to find all dataframes that contain the words hello and world. A & B would qualify, C would not.

I've tried: df[(df[column].str.contains('hello')) & (df[column].str.contains('world'))] which only picks up B and df[(df[column].str.contains('hello')) | (df[column].str.contains('world'))] which picks up all three.

I need something that picks only A & B

A=

    Name    Data   
0   Mike    hello    
1   Mike    world    
2   Mike    hello   
3   Fred    world
4   Fred    hello
5   Ted     world

B =

    Name    Data   
0   Mike    helloworld
1   Mike    world    
2   Mike    hello   
3   Fred    world
4   Fred    hello
5   Ted     world

C=

    Name    Data   
0   Mike    hello
1   Mike    hello    
2   Mike    hello   
3   Fred    hello
4   Fred    hello
5   Ted     hello

Upvotes: 4

Views: 123

Answers (3)

ALollz
ALollz

Reputation: 59549

You want a single bool value for if 'hello' is found anywhere and 'world' is found anywhere in one column:

df.Data.str.contains('hello').any() & df.Data.str.contains('world').any()

If you have a list of words and need to check over the entire DataFrame try:

import numpy as np

lst = ['hello', 'world']
np.logical_and.reduce([any(word in x for x in df.values.ravel()) for word in lst])

Sample Data

print(df)
   Name   Data   Data2
0  Mike  hello  orange
1  Mike  world  banana
2  Mike  hello  banana
3  Fred  world  apples
4  Fred  hello   mango
5   Ted  world    pear

lst = ['apple', 'hello', 'world']
np.logical_and.reduce([any(word in x for x in df.values.ravel()) for word in lst])
#True

lst = ['apple', 'hello', 'world', 'bear']
np.logical_and.reduce([any(word in x for x in df.values.ravel()) for word in lst])
# False

Upvotes: 5

Vaishali
Vaishali

Reputation: 38415

If hello and world are standalone strings in your data, df.eq() should do the job and you don't need str.contains. Its not a string method and works on entire dataframe.

(((df == 'hello').any()) & ((df == 'world').any())).any()

True

Upvotes: 1

BENY
BENY

Reputation: 323306

Using

import re 

bool(re.search(r'^(?=.*hello)(?=.*world)', df.sum().sum())
Out[461]: True

Upvotes: 2

Related Questions