Reputation: 218

Check if each value in a dataframe column contains words from another dataframe column

How do I iterate through each value in one dataframe column and check if it contains words in another dataframe column?

a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']})
b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']})

a    
    text
0   the cat jumped over the hat
1   the pope pulled on the rope
2   i lost my dog in the fog

b
    dirty_words
0   cat
1   dog
2   parakeet

I want to get a new dataframe that contains only these values:

result

0   the cat jumped over the hat
1   i lost my dog in the fog

Upvotes: 3

Answers (3)

BENY

Reputation: 323236

I think you can use isin after str.split

a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]: 
                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

Upvotes: 3

cs95

Reputation: 402493

Use regex matching with str.contains.

p = '|'.join(b['dirty_words'].dropna())
a[a['text'].str.contains(r'\b{}\b'.format(p))]

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

The word boundaries ensure you won't match "catch" just because it contains "cat" (thanks @DSM).

Upvotes: 3

jpp

Reputation: 164673

You can use a list comprehension with any after splitting strings by whitespace. This method won't include "catheter" just because it includes "cat".

mask = [any(i in words for i in b['dirty_words'].values) \
        for words in a['text'].str.split().values]

print(a[mask])

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

Upvotes: 3

Check if each value in a dataframe column contains words from another dataframe column

Answers (3)

Related Questions