jwnz
jwnz

Reputation: 75

Decoding Python Unicode strings that contain double blackslashes

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98

If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want .

If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.

How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?

Upvotes: 2

Views: 584

Answers (1)

PM 2Ring
PM 2Ring

Reputation: 55489

In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.

Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.

import unicodedata as ud

src = '\\xec\\x88\\x98'
print repr(src)

s = src.decode('string-escape')
print repr(s)

u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')

output

'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218

However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

Upvotes: 2

Related Questions