Hanfei Sun
Hanfei Sun

Reputation: 47051

What's the difference between these method to deal with Unicode strings in Python?

I tried print a_str.decode("utf-8"), print uni_str, print uni_str.decode("utf-8"),print uni_str.encode("utf-8")..

But only the first one works.

 >>> print '\xe8\xb7\xb3'.decode("utf-8")
 跳
 >>> print u'\xe8\xb7\xb3\xe8'
 è·³è
 >>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
 >>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
 è·³è

I'm really confused with how to display a Unicode string normally. If I have a string like this: a=u'\xe8\xb7\xb3\xe8', how can I print a?

Upvotes: 0

Views: 525

Answers (3)

Mohammad Alhashash
Mohammad Alhashash

Reputation: 1589

The unicode string u'\xe8\xb7\xb3\xe8' is equivalent to u'\u00e8\u00b7\u00b3\u00e8'. What you want is u'\u8df3' which can be encoded in utf8 as '\xe8\xb7\xb3'.

In Python, unicode is a UCS-2 string (build option). So, u'\xe8\xb7\xb3\xe8' is a string of 4 16bit Unicode characters.

If you got a utf-8 string (8bit string) incorrectly presented as Unicode (16bit string), you have to convert it to 8bit string first:

>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'

Note that '\xe8\xb7\xb3\xe8' is not valid utf8 string as the last byte '\xe8' is a first character of a two byte sequence and cannot terminate a utf8 string.

Upvotes: 0

xiaowl
xiaowl

Reputation: 5207

'\xe8\xb7\xb3' is a Chinese character encoded with utf8, so '\xe8\xb7\xb3'.decode('utf-8') works fine, which returns the unicode value of , u'\u8df3'. But u'\xe8\xb7\xb3' is a literal unicode String, which is not same with the unicode of . And a unicode string cannot be decoded, it's unicode. At last,a=u'\xe8\xb7\xb3\xe8' is really not a valid unicode string[1].

Where the u'\xe8\xb7\xb3' comes from? Another function?

[1]Check out the first comment.

Upvotes: 3

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798686

If you have a string like that then it's broken. You'll need to encode it as Latin-1 to get it to a bytestring with the same byte values, and then decode as UTF-8.

Upvotes: 1

Related Questions