Reputation: 1934
# coding=ascii
bad_string = '\x9a'
expected = u'š'
good_string = bad_string.decode('unicode-escape').encode('utf-8')
if good_string != expected:
raise AssertionError()
I would expect the above test to pass, but I'm getting the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
What am I missing here?
(I can't simply change bad_string
to be unicode. These are strings arriving from an outside source)
Upvotes: 1
Views: 1161
Reputation: 225015
'\x9a'
doesn’t have any escape characters in it. The escape is part of the string literal and the bytes represented are just one: [0x9a]
. The encoding might be Windows-1252, because that’s common and has š at 0x9a, but you really have to know what it is. To decode as Windows-1252:
good_string = bad_string.decode('cp2512')
If what you actually have is '\\x9a'
(one backslash, three other characters), then you’ll need to convert it to the above form first. The right way to do this depends on how the escapes managed to get there in the first place. If it’s from a Python string literal, use string-escape
first:
good_string = bad_string.decode('string-escape').decode('cp2512')
Upvotes: 1