Brad Johnson
Brad Johnson

Reputation: 1934

How to convert an ascii string with escape characters to its unicode equivalent

# coding=ascii
bad_string = '\x9a'
expected = u'š'
good_string = bad_string.decode('unicode-escape').encode('utf-8')
if good_string != expected:
    raise AssertionError()

I would expect the above test to pass, but I'm getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What am I missing here?

(I can't simply change bad_string to be unicode. These are strings arriving from an outside source)

Upvotes: 1

Views: 1161

Answers (1)

Ry-
Ry-

Reputation: 225015

'\x9a' doesn’t have any escape characters in it. The escape is part of the string literal and the bytes represented are just one: [0x9a]. The encoding might be Windows-1252, because that’s common and has š at 0x9a, but you really have to know what it is. To decode as Windows-1252:

good_string = bad_string.decode('cp2512')

If what you actually have is '\\x9a' (one backslash, three other characters), then you’ll need to convert it to the above form first. The right way to do this depends on how the escapes managed to get there in the first place. If it’s from a Python string literal, use string-escape first:

good_string = bad_string.decode('string-escape').decode('cp2512')

Upvotes: 1

Related Questions