chaimp
chaimp

Reputation: 17877

How to get python to accept unicode character 0x2000 (and others)

I am trying to remove certain characters from a string in Python. I have a list of characters or range of characters that I need removed, represented in hexidecimal like so:

- "0x00:0x20"
- "0x7F:0xA0"
- "0x1680"
- "0x180E"
- "0x2000:0x200A"

I am turning this list into a regular expression that looks like this:

re.sub(u'[\x00-\x20 \x7F-\xA0 \x1680 \x180E \x2000-\x200A]', ' ', my_str)

However, I am getting an error when I have \x2000-\x200A in there.

I have found that Python does not actually interpret u'\x2000' as a character:

>>> '\x2000'
' 00'

It is treating it like 'x20' (a space) and whatever else is after it:

>>> '\x20blah'
' blah'

x2000 is a valid unicode character: http://www.unicodemap.org/details/0x2000/index.html

I would like Python to treat it that way so I can use re to remove it from strings.

As an alternative, I would like to know of another way to remove these characters from strings.

I appreciate any help. Thanks!

Upvotes: 0

Views: 233

Answers (2)

thebjorn
thebjorn

Reputation: 27321

From the docs (https://docs.python.org/2/howto/unicode.html):

Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can’t express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.

>>> s = u"a\xac\u1234\u20ac\U00008000"
... #      ^^^^ two-digit hex escape
... #          ^^^^^^ four-digit Unicode escape
... #                      ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s:  print ord(c),
...
97 172 4660 8364 32768

Upvotes: 1

John1024
John1024

Reputation: 113914

In a unicode string, you need to specify unicode characters(\uNNNN not \xNNNN). The following works:

>>> import re
>>> my_str=u'\u2000abc'
>>> re.sub(u'[\x00-\x20 \x7F-\xA0 \u1680 \u180E \u2000-\u200A]', ' ', my_str)
' abc'

Upvotes: 2

Related Questions