why python2's re module can't identify the u'®' character

Question

I got a string and I want to re.sub this string in Python2, so I tried the following statement, it worked

>>> import re
>>> re.sub(u"[™®]", "", u"a™b®c")
'abc'

But when I tried the following statement, it just failed on both Windows 10 (Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)] on win32).

>>> re.sub(ur"[\u2122\u00ae]", "", u"a™b®c")
u'a?b?c'

I've tried the the solution from Python and regular expression with Unicode, but it didn't work neither.

>>> myre = re.compile(ur'[\u2122\u00ae]', re.UNICODE)
>>> print myre.sub('', u"a™b®c")

So why this happen and how can I fix this?

abarnert · Accepted Answer

You have two problems here.

First, the whole point of raw string literals is that they don't treat backslash escapes as backslash escapes. So, ur"[\u2122\u00ae]" is literally the characters [, \, u, 2, 1, etc.

In Python 3, that's fine, because the re module understands \u escapes as meaning Unicode characters, so the pattern ends up being the character class with U+2122 and U+00AE in it, exactly as you want. But in Python 2, it doesn't, so the character class ends up being a mess of useless junk.

If you change it to use a non raw string literal, that will solve that problem: u"[\u2122\u00ae]". Of course that will bring up all the other potential problems that make people want to use raw string literals in the first place with regular expressions—but fortunately, you don't have any of them here.

The second problem is that you're using Unicode characters in Unicode literals without an encoding declaration. Again, not a problem in Python 3, but it is in Python 2.

When you type "a™b®c", there's a good chance that you're actually giving Python not a \u2122 character, but a \u0099 character. Your console is probably in something like cp1252, so when you type or paste a ™, what it actually gives Python is, U+0099, not U+2122. Of course your console also displays things incorrectly, so that U+0099 ends up looking like a ™. But Python doesn't have any idea what's going on. It just sees that U+0099 is not the same character as U+2122, and therefore there's no match. (Your first example works because your search string also has the incorrect \u0099, so it happens to match.)

In source code, you could fix this, either by adding an encoding declaration to tell Python that you're using cp1252, or by telling your editor to use UTF-8 instead of cp1252 in the first place. But at the interactive interpreter, you get whatever encoding your console wants, and there's nowhere to put an encoding declaration.

Really, there's no good solution to this.

Well, there is: upgrade to Python 3. The main reason it exists in the first place is to make Unicode headaches like this go away, and Python 2 is less than a year and a half from end of life, so do you really want to learn how to deal with Unicode headaches in Python 2 today?

You could also get a UTF-8 terminal (and one that Python recognizes as such). That's automatic on macOS or most recent Linux distros; on Windows, it's a lot harder, and probably not the way you want to go here.

So, the only alternative is to just never use Unicode characters in Unicode literals on the interactive interpreter. Again, you can use them in source code, but interactively, you have to either:

Use backslash escapes.
Use non-Unicode literals and carefully decode them everywhere.

I'm not sure whether "a™b®c".decode('cp1252') is really better than \u escapes, but it will work.

why python2's re module can't identify the u'®' character

Answers (2)

Related Questions

why python2&#39;s re module can&#39;t identify the u&#39;&#174;&#39; character

Answers (2)

Related Questions

why python2's re module can't identify the u'®' character