Reputation: 17877
I am trying to remove certain characters from a string in Python. I have a list of characters or range of characters that I need removed, represented in hexidecimal like so:
- "0x00:0x20"
- "0x7F:0xA0"
- "0x1680"
- "0x180E"
- "0x2000:0x200A"
I am turning this list into a regular expression that looks like this:
re.sub(u'[\x00-\x20 \x7F-\xA0 \x1680 \x180E \x2000-\x200A]', ' ', my_str)
However, I am getting an error when I have \x2000-\x200A
in there.
I have found that Python does not actually interpret u'\x2000'
as a character:
>>> '\x2000'
' 00'
It is treating it like 'x20' (a space) and whatever else is after it:
>>> '\x20blah'
' blah'
x2000 is a valid unicode character: http://www.unicodemap.org/details/0x2000/index.html
I would like Python to treat it that way so I can use re
to remove it from strings.
As an alternative, I would like to know of another way to remove these characters from strings.
I appreciate any help. Thanks!
Upvotes: 0
Views: 233
Reputation: 27321
From the docs (https://docs.python.org/2/howto/unicode.html):
Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can’t express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
>>> s = u"a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
Upvotes: 1
Reputation: 113914
In a unicode string, you need to specify unicode characters(\uNNNN
not \xNNNN
). The following works:
>>> import re
>>> my_str=u'\u2000abc'
>>> re.sub(u'[\x00-\x20 \x7F-\xA0 \u1680 \u180E \u2000-\u200A]', ' ', my_str)
' abc'
Upvotes: 2