Reputation: 75
My strings look like this \\xec\\x88\\x98
, but if I print them they look like this \xec\x88\x98
, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98
and then decode it, I get the value I want 수
.
If I x.decode('unicode-escape')
it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape')
, the value I get is ì
.
How would I go about decoding the original \\xec\\x88\\x98
, so that I get the value correct output?
Upvotes: 2
Views: 584
Reputation: 55489
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98'
to '\xec\x88\x98'
, which is the UTF-8 encoding of u'\uc218'
.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.
Upvotes: 2