Reputation: 1599
I am trying to remove some whole words (but case insensitive) in a pyspark dataframe column.
import re
s = "I like the book. i'v seen it. Iv've" # add a new phrase
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"\b(" + r"|".join(exclude_words) +r")\b|\s", re.I|re.M)
exclude_words_re.sub("" , s)
I added
"Iv've"
but, got:
'like the book. seen it.'
"Iv've" should not be removed because it does not match any whole words in exclude_words.
Upvotes: 1
Views: 220
Reputation: 12098
2 changes to implement to your code:
\b
to only include whole words.import re
s = "I like the book. i'v seen it. Iv've I've"
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"(^|\b)((" + r"|".join(exclude_words) +r"))(\s|$)", re.I|re.M)
exclude_words_re.sub("" , s)
"like the book. seen it. Iv've "
Upvotes: 1