Convert utf-8 string to cp950 encoding in python

Question

I'm handling an encoding problem. My input is a unicode string, such as:

>>> s
u'\xa6\xe8\xac\xc9'

Actually it is encoded in cp950. I want to decode it: (notice there's no "u")

>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界

However, I don't know how to get rid of that "u". Direct conversion is not working:

>>> str(s)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

The result of using encode() is not what I wanted:

>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'

what I want is '\xa6\xe8\xac\xc9'

nneonneo · Accepted Answer

This is a bit of an abuse of the unicode type. Characters in a unicode string are expected to be Unicode codepoints (e.g. u'\u897f\u754c'), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode strings str, from byte strings bytes).

Since you want to just interpret each codepoint as bytes, you can do

u'\xa6\xe8\xac\xc9'.encode('iso-8859-1')

since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.

Convert utf-8 string to cp950 encoding in python

Answers (2)

Related Questions