Drj
Drj

Reputation: 1256

Non Matching word removal in python

I have a text based string, and want to retain only specific words.

sample = "This is a test text. Test text should pass the test"
approved_list = ["test", "text"]

Expected output:

"test text Test text test"

I have read through a lot of regex based answers, unfortunately they do not address this specific issue.

Can the solution also be extended to a pandas series?

Upvotes: 1

Views: 31

Answers (1)

piRSquared
piRSquared

Reputation: 294218

You don't need pandas for this. Use the regex module re

import re

re.findall('|'.join(approved_list), sample, re.IGNORECASE)

['test', 'text', 'Test', 'text', 'test']

If you had a pd.Series

sample = pd.Series(["This is a test text. Test text should pass the test"] * 5)
approved_list = ["test", "text"]

Use the str string accessor

sample.str.findall('|'.join(approved_list), re.IGNORECASE)

0    [test, text, Test, text, test]
1    [test, text, Test, text, test]
2    [test, text, Test, text, test]
3    [test, text, Test, text, test]
4    [test, text, Test, text, test]
dtype: object

Upvotes: 2

Related Questions