Reputation: 218
How do I iterate through each value in one dataframe column and check if it contains words in another dataframe column?
a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']})
b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']})
a
text
0 the cat jumped over the hat
1 the pope pulled on the rope
2 i lost my dog in the fog
b
dirty_words
0 cat
1 dog
2 parakeet
I want to get a new dataframe that contains only these values:
result
0 the cat jumped over the hat
1 i lost my dog in the fog
Upvotes: 3
Views: 2592
Reputation: 323236
I think you can use isin
after str.split
a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]:
text
0 the cat jumped over the hat
2 i lost my dog in the fog
Upvotes: 3
Reputation: 402493
Use regex matching with str.contains
.
p = '|'.join(b['dirty_words'].dropna())
a[a['text'].str.contains(r'\b{}\b'.format(p))]
text
0 the cat jumped over the hat
2 i lost my dog in the fog
The word boundaries ensure you won't match "catch" just because it contains "cat" (thanks @DSM).
Upvotes: 3
Reputation: 164673
You can use a list comprehension with any
after splitting strings by whitespace. This method won't include "catheter" just because it includes "cat".
mask = [any(i in words for i in b['dirty_words'].values) \
for words in a['text'].str.split().values]
print(a[mask])
text
0 the cat jumped over the hat
2 i lost my dog in the fog
Upvotes: 3