Check if a column contains words from another column in pandas dataframe

Question

IS there a way we can check if values contained in a column (comma separated) are present in another column(sentence) quickly. Retain the words present and remove the words not present in a pandas dataframe using python ?

Original Data is Like this

|sentence      | word|
----------------------

Hello World    |world

Hi how are you |are, car

I am good      |good,bad,sad, am

and the result should be like .

|sentence      | word|
----------------------

Hello World    |world

Hi how are you |are

I am good      |good, am

Performance should be considered as this is a huge dataset

Quang Hoang · Accepted Answer

Since most pandas' string operations are not vectorized, you can just do a list comprehension like this:

df['word'] = [', '.join([w for w in ws if w in s]) 
                for s, ws in zip(df.sentence.str.lower(), df.word.str.split(',\s*'))
             ]

Output:

         sentence      word
0     Hello World     world
1  Hi how are you       are
2       I am good  good, am

Note: This is just an idea that needs improvement such as matching words only (e.g. regex)

Update as noted above, here's such an improvement that only matches whole word:

 import re
 df['word'] = [', '.join([w for w in ws 
                     if re.search(f'\b{w}\b', s)) ]
                for s, ws in zip(df.sentence.str.lower(), df.word.str.split(',\s*'))
             ]

Check if a column contains words from another column in pandas dataframe

Answers (1)

Related Questions