PagMax
PagMax

Reputation: 8568

'and' operator in string.contains

I have a pandas series in which I am applying string search this way

df['column_name'].str.contains('test1')

This gives me true/false list depending on string 'test1' is contained in column 'column_name' or not.

However I am not able to test two strings where I need to check if both strings are there or not. Something like

  df['column_name'].str.contains('test1' and 'test2')

This does not seem to work. Any suggestions would be great.

Upvotes: 2

Views: 2702

Answers (5)

Just use reduce() method if you need to apply a list of strings as a filter

from functools import reduce
import pandas as pd

df = pd.DataFrame({
    'column_name': [1,'test1_sdv_test2_vsd',3,4,5, 'test2test1'],
    'column_name_2': [3,6,3,2,7,8]
})

items = ['test1', 'test2'] # list of strings you want to apply as filter


def filter_series_by_list(s, items): 
    return reduce(lambda a, b: a & b, (s.str.contains(item, na=False) for item in items))


print(filter_series_by_list(df['column_name'], items))


RESULT:
0    False
1    True
2    False
3    False
4    False
5    True
Name: column_name, dtype: bool

Upvotes: 0

B. M.
B. M.

Reputation: 18628

You want to know if test1 AND test2 are somewhere in the column.

So df['col_name'].str.contains('test1').any() & df['col_name'].str.contains('test2').any().

Upvotes: 0

Goodies
Goodies

Reputation: 4681

Ignoring the missing quote from 'test2, the 'and' operator is a boolean logical operator. It does not concatenate strings and it does not perform the action that you are thinking it does.

>>> 'test1' and 'test2'
'test2'
>>> 'test1' or 'test2'
'test1'
>>> 10 and 20
20
>>> 10 and 0
10
>>> 0 or 20
20
>>> # => and so on...

This occurs because the and and or operators function as 'truth deciders' and have mildly strange behavior with strings. In essence, the return value is the last value to have been evaluated, whether it's a string or otherwise. Look at this behavior:

>>> a = 'test1'
>>> b = 'test2'
>>> c = a and b
>>> c is a
False
>>> c is b
True

The latter value is assigned to the variable to which we are giving it. What you're looking for is a way to iterate over a list or set of strings and ensure that all of them result in true. We use the all(iterable) function for this.

if all([df['column_name'].contains(_) for _ in ['test1', 'test2']]):
    print("All strings are contained in it.")
else:
    print("Not all strings are contained in it.")

Assuming the case is true, the following is an example of what you'd receive:

>>> x = [_ in df['column_name'] for _ in ['test1', 'test2']
>>> print(x)
[True, True] # => returns True for all()
>>> all(x)
True
>>> x[0] = 'ThisIsNotIntTheColumn' in df['column_name']
>>> print(x)
[False, True]
>>> all(x)
False

Upvotes: 2

user2255757
user2255757

Reputation: 766

all( word in df['column_name'] for word in ['test1', 'test2'] )

this will test an arbitrary number or words present in a string

Upvotes: 2

EdChum
EdChum

Reputation: 393963

No you have to create 2 conditions and use & and wrap parentheses around the conditions due to operator precedence:

(df['column_name'].str.contains('test1')) & (df['column_name'].str.contains('test2))

If you wanted to test for either word then the following would work:

df['column_name'].str.contains('test1|test2')

Upvotes: 7

Related Questions