Reputation: 29
IS there a way we can check if values contained in a column (comma separated) are present in another column(sentence) quickly. Retain the words present and remove the words not present in a pandas dataframe using python ?
Original Data is Like this
|sentence | word|
----------------------
Hello World |world
Hi how are you |are, car
I am good |good,bad,sad, am
and the result should be like .
|sentence | word|
----------------------
Hello World |world
Hi how are you |are
I am good |good, am
Performance should be considered as this is a huge dataset
Upvotes: 1
Views: 1372
Reputation: 150755
Since most pandas' string operations are not vectorized, you can just do a list comprehension like this:
df['word'] = [', '.join([w for w in ws if w in s])
for s, ws in zip(df.sentence.str.lower(), df.word.str.split(',\s*'))
]
Output:
sentence word
0 Hello World world
1 Hi how are you are
2 I am good good, am
Note: This is just an idea that needs improvement such as matching words only (e.g. regex
)
Update as noted above, here's such an improvement that only matches whole word:
import re
df['word'] = [', '.join([w for w in ws
if re.search(f'\\b{w}\\b', s)) ]
for s, ws in zip(df.sentence.str.lower(), df.word.str.split(',\s*'))
]
Upvotes: 1