Reputation: 8568
I have a pandas series in which I am applying string search this way
df['column_name'].str.contains('test1')
This gives me true/false list depending on string 'test1' is contained in column 'column_name' or not.
However I am not able to test two strings where I need to check if both strings are there or not. Something like
df['column_name'].str.contains('test1' and 'test2')
This does not seem to work. Any suggestions would be great.
Upvotes: 2
Views: 2702
Reputation: 1
Just use reduce() method if you need to apply a list of strings as a filter
from functools import reduce
import pandas as pd
df = pd.DataFrame({
'column_name': [1,'test1_sdv_test2_vsd',3,4,5, 'test2test1'],
'column_name_2': [3,6,3,2,7,8]
})
items = ['test1', 'test2'] # list of strings you want to apply as filter
def filter_series_by_list(s, items):
return reduce(lambda a, b: a & b, (s.str.contains(item, na=False) for item in items))
print(filter_series_by_list(df['column_name'], items))
RESULT:
0 False
1 True
2 False
3 False
4 False
5 True
Name: column_name, dtype: bool
Upvotes: 0
Reputation: 18628
You want to know if test1
AND test2
are somewhere in the column.
So df['col_name'].str.contains('test1').any() & df['col_name'].str.contains('test2').any()
.
Upvotes: 0
Reputation: 4681
Ignoring the missing quote from 'test2
, the 'and' operator is a boolean logical operator. It does not concatenate strings and it does not perform the action that you are thinking it does.
>>> 'test1' and 'test2'
'test2'
>>> 'test1' or 'test2'
'test1'
>>> 10 and 20
20
>>> 10 and 0
10
>>> 0 or 20
20
>>> # => and so on...
This occurs because the and
and or
operators function as 'truth deciders' and have mildly strange behavior with strings. In essence, the return value is the last value to have been evaluated, whether it's a string or otherwise. Look at this behavior:
>>> a = 'test1'
>>> b = 'test2'
>>> c = a and b
>>> c is a
False
>>> c is b
True
The latter value is assigned to the variable to which we are giving it. What you're looking for is a way to iterate over a list or set of strings and ensure that all of them result in true. We use the all(iterable)
function for this.
if all([df['column_name'].contains(_) for _ in ['test1', 'test2']]):
print("All strings are contained in it.")
else:
print("Not all strings are contained in it.")
Assuming the case is true, the following is an example of what you'd receive:
>>> x = [_ in df['column_name'] for _ in ['test1', 'test2']
>>> print(x)
[True, True] # => returns True for all()
>>> all(x)
True
>>> x[0] = 'ThisIsNotIntTheColumn' in df['column_name']
>>> print(x)
[False, True]
>>> all(x)
False
Upvotes: 2
Reputation: 766
all( word in df['column_name'] for word in ['test1', 'test2'] )
this will test an arbitrary number or words present in a string
Upvotes: 2
Reputation: 393963
No you have to create 2 conditions and use &
and wrap parentheses around the conditions due to operator precedence:
(df['column_name'].str.contains('test1')) & (df['column_name'].str.contains('test2))
If you wanted to test for either word then the following would work:
df['column_name'].str.contains('test1|test2')
Upvotes: 7