Reputation: 167
I'm trying to match and remove all words in a list from a string using a compiled regex but I'm struggling to avoid occurrences within words.
Current:
REMOVE_LIST = ["a", "an", "as", "at", ...]
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
out = regex.sub("", text)
In: "The quick brown fox jumped over an ant"
Out: "quick brown fox jumped over t"
Expected: "quick brown fox jumped over"
I've tried changing the string to compile to the following but to no avail:
regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)
Any suggestions or am I missing something garishly obvious?
Upvotes: 9
Views: 26278
Reputation: 6086
here is a suggestion without using regex you may want to consider:
>>> sentence = 'word1 word2 word3 word1 word2 word4'
>>> remove_list = ['word1', 'word2']
>>> word_list = sentence.split()
>>> ' '.join([i for i in word_list if i not in remove_list])
'word3 word4'
Upvotes: 19
Reputation: 500297
One problem is that only the first \b
is inside a raw string. The second gets interpreted as the backspace character (ASCII 8) rather than as a word boundary.
To fix, change
regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)
to
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
^ THIS
Upvotes: 14