Daming Lu
Daming Lu

Reputation: 376

Chinese encoding in Python

When I output some Chinese character in Python (Pandas), it shows as below

\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85\xe5\x86\xb5\xe6\x98\xaf\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x95\x85\xe9\x9a\x9c\xe7\x81\xaf\xef\xbc\x8c\xe6\xa3\x80\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x8f\x92\xe5\xa4\xb4\xe6\x98\xaf\xe5\x90\xa6\xe6\x8e\xa5\xe8\x99\x9a\xef\xbc\x8c\xe7\x84\xb6\xe5\x90\x8e\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe5\x86\x85\xe7\xae\xa1\xe9\x81\x93\xe5\x8e\x8b\xe5\x8a\x9b\xe6\x98\xaf\xe5\x90\xa6\xe7\xac\xa6\xe5\x90\x88\xe6\xad\xa3\xe5\xb8\xb8\xe5\x80\xbc\xe3\x80\x82

What is the encoding format? It is not unicode as I know. Thanks!

Upvotes: 0

Views: 10877

Answers (3)

MilkyWay90
MilkyWay90

Reputation: 2093

The output you are receiving is called a bytes object. In order to decode it, you need to do output.decode('utf-8').

For example:

output = b'\xe8\xbf\x99\xe7...'
unicode_output = output.decode('utf-8')
print(unicode_output)

would then output non-latin characters (I cannot include it because it counts as spam).

Another way to do this in one-line would be: print(b'\xe8\xbf\x99\xe7...'.decode('utf-8')).

However, if that doesn't work, then it is probably because of the fact that your output isn't a bytes object, but is contained within a string. If that does not work, then there is another solution.

output = '\xe8\xbf\x99\xe7...'
exec('print(b\''+ output + '\'.decode(\'utf-8\'))')

That should be able to fix it. Hope you got something useful out of this. Have a good day!

Upvotes: 1

rigsby
rigsby

Reputation: 792

raw_bytes = b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85 . . .'

with raw_bytes a <class 'bytes'> object containing your hexadecimal characters you can then call decode on raw_bytes and get a <class 'str'> representation of your characters.

string_text = raw_bytes.decode("utf-8")

Upvotes: 0

Victor Sergienko
Victor Sergienko

Reputation: 13495

This is bytes type, containing a valid utf-8 Chinese text (as far as I can trust Google Translate).

If it's a string literal from your code, add # -*- coding: utf-8 -*- as the first line of your Python file.

If it's an external data, here's how to convert it to a text (str type): bytes_text.decode("utf-8")

Upvotes: 0

Related Questions