user3448011
user3448011

Reputation: 1599

pyspark dataframe: remove some whole words but case insensitive in a column

I am trying to remove some whole words (but case insensitive) in a pyspark dataframe column.

import re
s = "I like the book. i'v seen it. Iv've" # add a new phrase
exclude_words = ["I", "I\'v", "I\'ve"]

exclude_words_re = re.compile(r"\b(" + r"|".join(exclude_words) +r")\b|\s", re.I|re.M)
exclude_words_re.sub("" , s)

I added

 "Iv've"

but, got:

'like the book. seen it.'

 

"Iv've" should not be removed because it does not match any whole words in exclude_words.

Upvotes: 1

Views: 220

Answers (1)

Yaakov Bressler
Yaakov Bressler

Reputation: 12098

2 changes to implement to your code:

  1. Use proper regex flags to ignore case
  2. Add \b to only include whole words.
import re
s = "I like the book. i'v seen it. Iv've I've"
exclude_words = ["I", "I\'v", "I\'ve"]


exclude_words_re = re.compile(r"(^|\b)((" + r"|".join(exclude_words) +r"))(\s|$)", re.I|re.M)
exclude_words_re.sub("" , s)

"like the book. seen it. Iv've "

Upvotes: 1

Related Questions