Reputation: 1999
In Python 3.3 I have no trouble using ranges of Unicode codepoints within regular expressions:
>>> import re
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff]'
>>> s = 'abcdABCD¯˘¸ðﺉ﹅ffl你我他𐀀𐌈𐆒'
>>> print(s)
abcdABCD¯˘¸ðﺉ﹅ffl你我他𐀀𐌈𐆒
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他𐀀𐌈𐆒
It's clean and simple. But if I include codepoints of five hex digits, that is to say anything higher than \uffff
, such as \u1047f
, as part of a range beginning in four hex digits, I get an error:
>>> to_delete = '[\u0020-\u0090\ufb00-\u1047f]'
>>> print(re.sub(to_delete, '', s))
...
sre_constants.error: bad character range
There is no error if I start a new five-digit range, but I also do not get the expected behavior:
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\u10000-\u1047f]'
>>> print(re.sub(to_delete, '', s))
你我他𐀀𐌈𐆒
(The symbols 𐀀𐌈𐆒 are codepoints \u10000
, \u10308
, and \u10192
, respectively and should have been replaced in the last re.sub
operation.)
Following the instructions of the accepted answer:
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\U00010000-\U0001047F]'
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他
Perfect. Uglesome in the extreme, but perfect.
Upvotes: 0
Views: 785
Reputation: 11
\u
only supports 16-bit codepoints. You need to use the 32-bit version, \U
. Note that it requires 8 digits, so you have to prepend a few 0s (e.g. \U00010D2B
).
Source: http://docs.python.org/3/howto/unicode.html#unicode-literals-in-python-source-code
Upvotes: 3