regex - specifying high Unicode codepoints in Python

Question

In Python 3.3 I have no trouble using ranges of Unicode codepoints within regular expressions:

>>> import re
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff]'
>>> s = 'abcdABCD¯˘¸ðﺉ﹅ﬄ你我他𐀀𐌈𐆒'
>>> print(s)
abcdABCD¯˘¸ðﺉ﹅ﬄ你我他𐀀𐌈𐆒
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他𐀀𐌈𐆒

It's clean and simple. But if I include codepoints of five hex digits, that is to say anything higher than \uffff, such as \u1047f, as part of a range beginning in four hex digits, I get an error:

>>> to_delete = '[\u0020-\u0090\ufb00-\u1047f]'
>>> print(re.sub(to_delete, '', s))
...
sre_constants.error: bad character range

There is no error if I start a new five-digit range, but I also do not get the expected behavior:

>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\u10000-\u1047f]'
>>> print(re.sub(to_delete, '', s))
你我他𐀀𐌈𐆒

(The symbols 𐀀𐌈𐆒 are codepoints \u10000, \u10308, and \u10192, respectively and should have been replaced in the last re.sub operation.)

Following the instructions of the accepted answer:

>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\U00010000-\U0001047F]'
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他

Perfect. Uglesome in the extreme, but perfect.

JAB · Accepted Answer

\u only supports 16-bit codepoints. You need to use the 32-bit version, \U. Note that it requires 8 digits, so you have to prepend a few 0s (e.g. \U00010D2B).

Source: http://docs.python.org/3/howto/unicode.html#unicode-literals-in-python-source-code

regex - specifying high Unicode codepoints in Python

Answers (1)

Related Questions