Reputation: 79

Why i cannot display the chinese character in python even with the use of encoding?

I was simply trying to import a Chinese txt file and print out the content. Here is the content of my txt file which i copy from the web,which is in simplified chinese :http://stock.hexun.com/2013-06-01/154742801.html

At first, i tried this out:

userinput = raw_input('Enter the name of a file')
f=open(userinput,'r')
print f.read()
f.close()

It can open the file and print but what is show is garbled. Then i tried the following one with encoding:

#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str1=f.read()
print str1
f.close()

However, it show me an error message. UnicodeEncodeError: 'cp950 codec cant't encode character u'\u76d8' in position 50:illegal mutibyte sequence.

Why is that error happened? How to solve it? I have tried other unicode like Big5,cp950... but it still not work.

Upvotes: 5

Answers (5)

Jean David

Reputation: 51

with open('chinese.txt','r+b') as inputFile:
    bytes = inputFile.read()
    print(bytes.decode('utf8'))

Upvotes: 0

Alfe

Reputation: 59426

Python (at least before Python 3.0) knows two kinds of string: ① a byte array and ② a character array.

Characters as in ② are Unicode, the type of these kind of strings is also called unicode.

The bytes in ① (type named str in Python) can be a printable string or something else (binary data). If it's a printable string, it also can be an encoded version (e. g. UTF-8, latin-1 or similar) of a string of Unicode characters. Then several bytes can represent a single character.

In your usecase I'd propose to read the file as a list of bytes:

with open('filename.txt') as inputFile:
    bytes = inputFile.read()

Then convert that byte array to a decent Unicode string by decoding it from the encoding used in the file (you will have to find that out!):

unicodeText = bytes.decode('utf-8')

Then print it:

print unicodeText

The last step depends on the capabilities of your output device (xterm, …). It may be capable of displaying Unicode characters, then everything is fine and the characters get displayed properly. But it might be incapable of Unicode, or, more likely, Python is just not well-informed about the capabilities, then you will get an error message. This also will happen if you redirect your output into a file or pipe it into a second process.

To prevent this trouble, you can convert the Unicode string into a byte-array again, choosing an encoding of your choice:

print unicodeText.encode('utf-8')

This way you will only print bytes which every terminal, output file and second process (when piping) can handle.

If input and output encoding are the same, then of course, you won't have to decode and encode anything. But since you have some trouble, most likely the encodings differ, so you will have to do these two steps.

Upvotes: 1

cdarke

Reputation: 44354

It is the terminal system you are using to display the character. Using IDLE on Windows 7 and it works fine:

>>> val = u'\u76d8'
>>> print val
盘

but if I use cmd.exe then I get your error.

Use a terminal display method that supports unicode encoding.

Upvotes: 7

tcpiper

Reputation: 2544

JUst TRY:

f=open(userinput,'r')
print f.read().decode('gb18030').encode('u8')

Upvotes: -1

SheetJS

Reputation: 22905

Code page 936 is the only one that has character 0x76D8 (which encodes to 0xC5CC). You need to use gbk or cp936

Upvotes: 0

Why i cannot display the chinese character in python even with the use of encoding?

Answers (5)

Related Questions