Reputation: 47051
I tried print a_str.decode("utf-8")
, print uni_str
, print uni_str.decode("utf-8")
,print uni_str.encode("utf-8")
..
But only the first one works.
>>> print '\xe8\xb7\xb3'.decode("utf-8")
跳
>>> print u'\xe8\xb7\xb3\xe8'
è·³è
>>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
è·³è
I'm really confused with how to display a Unicode string normally. If I have a string like this:
a=u'\xe8\xb7\xb3\xe8'
, how can I print a
?
Upvotes: 0
Views: 525
Reputation: 1589
The unicode string u'\xe8\xb7\xb3\xe8'
is equivalent to u'\u00e8\u00b7\u00b3\u00e8'
. What you want is u'\u8df3'
which can be encoded in utf8 as '\xe8\xb7\xb3'
.
In Python, unicode is a UCS-2 string (build option). So, u'\xe8\xb7\xb3\xe8'
is a string of 4 16bit Unicode characters.
If you got a utf-8 string (8bit string) incorrectly presented as Unicode (16bit string), you have to convert it to 8bit string first:
>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'
Note that '\xe8\xb7\xb3\xe8'
is not valid utf8 string as the last byte '\xe8'
is a first character of a two byte sequence and cannot terminate a utf8 string.
Upvotes: 0
Reputation: 5207
'\xe8\xb7\xb3'
is a Chinese character encoded with utf8
, so '\xe8\xb7\xb3'.decode('utf-8')
works fine, which returns the unicode value of 跳
, u'\u8df3'
. But u'\xe8\xb7\xb3'
is a literal unicode String, which is not same with the unicode of 跳
. And a unicode string cannot be decoded
, it's unicode.
At last,[1]. a=u'\xe8\xb7\xb3\xe8'
is really not a valid unicode string
Where the u'\xe8\xb7\xb3'
comes from? Another function?
[1]Check out the first comment.
Upvotes: 3
Reputation: 798686
If you have a string like that then it's broken. You'll need to encode it as Latin-1 to get it to a bytestring with the same byte values, and then decode as UTF-8.
Upvotes: 1