Reputation: 1811
I have a dataframe with five rows that looks like this:
index col1 col2 col3 col4 col5
1 word1 None word1 None None
2 None word1 word2 None None
3 None None None word2 word2
4 word1 word2 None None None
I'm trying to find all rows that contain both strings in any combination of columns---in this case, rows 2 and 4. Normally I would use the str.contains
method to filter by string:
df[df['col1'].str.contains('word1 | word2'), case=False)
But this only gives me A) results for one column, and B) a True if the column has one word. I intuitively tried df[df[['col1', 'col2', 'col3', 'col4', 'col5']].str.contains('word1' & 'word2'), case=False)
but .str.contains
doesn't work on DataFrame objects.
Is there a way to do this without resorting to a for loop?
Upvotes: 2
Views: 814
Reputation: 51335
If there is only 2 words you are looking for, You could use np.isin
and any
to check if each row in the underlying numpy
array contains both the elements, using a separate isin
for each word:
df[np.isin(df.values, 'word1').any(1) & np.isin(df.values, 'word2').any(1)]
index col1 col2 col3 col4 col5
1 2 None word1 word2 None None
3 4 word1 word2 None None None
Or, following the same logic but borrowing a bit from @coldspeed's answer:
words = ['word1','word2']
df[np.logical_and.reduce([np.isin(df.values, w).any(1) for w in words])]
index col1 col2 col3 col4 col5
1 2 None word1 word2 None None
3 4 word1 word2 None None None
Upvotes: 4
Reputation: 402483
Assuming you want only the rows with both word1 and word2 somewhere, you will need to stack
, groupby
index, and search inside an apply
.
words = ['word1', 'word2']
df[df.stack().groupby(level=0).apply(
lambda x: all(x.str.contains(w, case=False).any() for w in words))]
print(df)
col1 col2 col3 col4 col5
index
2 None word1 word2 None None # word1=>col2, word2=>col3
4 word1 word2 None None None # word1=>col1, word2=>col2
Another alternative would be using np.logical_and.reduce
:
v = df.stack()
m = pd.Series(
np.logical_and.reduce([
v.str.contains(w, case=False).groupby(level=0).transform('any')
for w in words]),
index=v.index)
df = df[m.unstack().all(1)]
print(df)
col1 col2 col3 col4 col5
index
2 None word1 word2 None None
4 word1 word2 None None None
Upvotes: 2
Reputation: 323226
Using any
s1=df.apply(lambda x : x.str.contains(r'word1')).any(1)
s2=df.apply(lambda x : x.str.contains(r'word2')).any(1)
df[s1&s2]
Out[452]:
col1 col2 col3 col4 col5
index
2 None word1 word2 None None
4 word1 word2 None None None
Upvotes: 4