Reputation: 10251
I am attempting to remove all special characters from some text, here is my regex:
pattern = re.compile('[\W_]+', re.UNICODE)
words = str(pattern.sub(' ', words))
Super simple, but unfortunately it is causing problems when using apostrophes (single quotes). For example, if I had the word "doesn't", this code is returning "doesn".
Is there any way of adapting this regex so that it doesn't remove apostrophes in instances like this?
edit: here is what I am after:
doesn't this mean it -technically- works?
should be:
doesn't this mean it technically works
Upvotes: 5
Views: 15606
Reputation: 4073
Like this?
>>> pattern=re.compile("[^\w']")
>>> pattern.sub(' ', "doesn't it rain today?")
"doesn't it rain today "
If underscores also should be filtered away:
>>> re.compile("[^\w']|_").sub(" ","doesn't this _technically_ means it works? naïve I am ...")
"doesn't this technically means it works naïve I am "
Upvotes: 12
Reputation: 10717
How about ([^\w']|_)+
?
Note that this won't work well for things like:
doesn't this mean it 'technically' works?
Which might not be exactly what you're after.
Upvotes: 0
Reputation: 24788
How about
re.sub(r"[^\w' ]", "", "doesn't this mean it -technically- works?")
Upvotes: 0
Reputation: 4111
I was able to parse your sample into a list of words using this regex: [a-z]*'?[a-z]+
.
Then you can just join the elements of the list back with a space.
Upvotes: 1