Reputation: 133
I have a text where I want to remove all nonalphanumeric characters, but keep unicode encoded characters AND apostrophe, since it's part of words like wasn't, couldn't, French contractions, etc. I know I can do re.sub(ur'\W', '', text, flags = re.UNICODE)
to remove all nonalphanumeric characters, but I'm not sure how to do the same to preserve the apostrophe. Clearly re.sub(ur'[^A-Za-z0-9\'], '',text)
doesn't work because it would get rid of unicode encoded characters. Any ideas?
Upvotes: 1
Views: 1622
Reputation: 155418
In addition to re
with re.UNICODE
, if you're working with Py2 unicode
or Py3 str
, the predicate functions are Unicode type aware. So you could do:
# Py2 (convert text to unicode if it isn't already)
if not isinstance(text, unicode):
text = text.decode("utf-8") # Or latin-1; whatever encoding you're implicitly assuming
u''.join(let for let in text if let == u"'" or let.isalnum())
# Py3
''.join(let for let in text if let == "'" or let.isalnum())
This is almost certainly slower than using re
, but I figured I'd mention it for completeness.
Upvotes: 0
Reputation: 27283
You can use character class shorthands inside character classes:
re.sub(ur"[^\w']+", "", text, flags=re.UNICODE)
Upvotes: 1