Hanpan
Hanpan

Reputation: 10251

Python Regex - Remove special characters but preserve apostraphes

I am attempting to remove all special characters from some text, here is my regex:

pattern = re.compile('[\W_]+', re.UNICODE)
words = str(pattern.sub(' ', words))

Super simple, but unfortunately it is causing problems when using apostrophes (single quotes). For example, if I had the word "doesn't", this code is returning "doesn".

Is there any way of adapting this regex so that it doesn't remove apostrophes in instances like this?

edit: here is what I am after:

doesn't this mean it -technically- works?

should be:

doesn't this mean it technically works

Upvotes: 5

Views: 15606

Answers (4)

tobixen
tobixen

Reputation: 4073

Like this?

>>> pattern=re.compile("[^\w']")
>>> pattern.sub(' ', "doesn't it rain today?")
"doesn't it rain today "

If underscores also should be filtered away:

>>> re.compile("[^\w']|_").sub(" ","doesn't this _technically_ means it works? naïve I am ...")
"doesn't this  technically  means it works  naïve I am    "

Upvotes: 12

cha0site
cha0site

Reputation: 10717

How about ([^\w']|_)+?

Note that this won't work well for things like:

doesn't this mean it 'technically' works?

Which might not be exactly what you're after.

Upvotes: 0

Joel Cornett
Joel Cornett

Reputation: 24788

How about

re.sub(r"[^\w' ]", "", "doesn't this mean it -technically- works?")

Upvotes: 0

Mike Z
Mike Z

Reputation: 4111

I was able to parse your sample into a list of words using this regex: [a-z]*'?[a-z]+.

Then you can just join the elements of the list back with a space.

Upvotes: 1

Related Questions