How to remove non alphanumeric characters while keeping unicode encoded chars AND apostrophe ( \' )?

Question

I have a text where I want to remove all nonalphanumeric characters, but keep unicode encoded characters AND apostrophe, since it's part of words like wasn't, couldn't, French contractions, etc. I know I can do re.sub(ur'\W', '', text, flags = re.UNICODE) to remove all nonalphanumeric characters, but I'm not sure how to do the same to preserve the apostrophe. Clearly re.sub(ur'[^A-Za-z0-9\'], '',text) doesn't work because it would get rid of unicode encoded characters. Any ideas?

L3viathan · Accepted Answer

You can use character class shorthands inside character classes:

re.sub(ur"[^\w']+", "", text, flags=re.UNICODE)

How to remove non alphanumeric characters while keeping unicode encoded chars AND apostrophe ( \' )?

Answers (2)

Related Questions

How to remove non alphanumeric characters while keeping unicode encoded chars AND apostrophe ( \&#39; )?

Answers (2)

Related Questions

How to remove non alphanumeric characters while keeping unicode encoded chars AND apostrophe ( \' )?