Decoding Python Unicode strings that contain double blackslashes

Question

My strings look like this \xec\x88\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98

If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.

If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.

How would I go about decoding the original \xec\x88\x98, so that I get the value correct output?

PM 2Ring · Accepted Answer

In Python 2 you can use the 'string-escape' codec to convert '\xec\x88\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.

Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.

import unicodedata as ud

src = '\xec\x88\x98'
print repr(src)

s = src.decode('string-escape')
print repr(s)

u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')

output

'\xec\x88\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218

However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

Decoding Python Unicode strings that contain double blackslashes

Answers (1)

Related Questions