L. Chu
L. Chu

Reputation: 133

How to remove non alphanumeric characters while keeping unicode encoded chars AND apostrophe ( \' )?

I have a text where I want to remove all nonalphanumeric characters, but keep unicode encoded characters AND apostrophe, since it's part of words like wasn't, couldn't, French contractions, etc. I know I can do re.sub(ur'\W', '', text, flags = re.UNICODE) to remove all nonalphanumeric characters, but I'm not sure how to do the same to preserve the apostrophe. Clearly re.sub(ur'[^A-Za-z0-9\'], '',text) doesn't work because it would get rid of unicode encoded characters. Any ideas?

Upvotes: 1

Views: 1622

Answers (2)

ShadowRanger
ShadowRanger

Reputation: 155418

In addition to re with re.UNICODE, if you're working with Py2 unicode or Py3 str, the predicate functions are Unicode type aware. So you could do:

# Py2 (convert text to unicode if it isn't already)
if not isinstance(text, unicode):
    text = text.decode("utf-8")  # Or latin-1; whatever encoding you're implicitly assuming
u''.join(let for let in text if let == u"'" or let.isalnum())

# Py3
''.join(let for let in text if let == "'" or let.isalnum())

This is almost certainly slower than using re, but I figured I'd mention it for completeness.

Upvotes: 0

L3viathan
L3viathan

Reputation: 27283

You can use character class shorthands inside character classes:

re.sub(ur"[^\w']+", "", text, flags=re.UNICODE)

Upvotes: 1

Related Questions