Reputation: 129
I'm handling an encoding problem. My input is a unicode string, such as:
>>> s
u'\xa6\xe8\xac\xc9'
Actually it is encoded in cp950. I want to decode it: (notice there's no "u")
>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界
However, I don't know how to get rid of that "u". Direct conversion is not working:
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The result of using encode() is not what I wanted:
>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'
what I want is '\xa6\xe8\xac\xc9'
Upvotes: 0
Views: 4045
Reputation: 308176
So let's get this straight: you have a sequence of bytes that were read in as Unicode codepoints, and you need them to be interpreted as cp950 instead?
>>> ''.join(chr(ord(c)) for c in s)
'\xa6\xe8\xac\xc9'
>>> print ''.join(chr(ord(c)) for c in s).decode('cp950')
西界
Upvotes: 0
Reputation: 179422
This is a bit of an abuse of the unicode
type. Characters in a unicode
string are expected to be Unicode codepoints (e.g. u'\u897f\u754c'
), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode strings str
, from byte strings bytes
).
Since you want to just interpret each codepoint as bytes, you can do
u'\xa6\xe8\xac\xc9'.encode('iso-8859-1')
since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.
Upvotes: 2