Reputation: 79
I was simply trying to import a Chinese txt file and print out the content. Here is the content of my txt file which i copy from the web,which is in simplified chinese :http://stock.hexun.com/2013-06-01/154742801.html
At first, i tried this out:
userinput = raw_input('Enter the name of a file')
f=open(userinput,'r')
print f.read()
f.close()
It can open the file and print but what is show is garbled. Then i tried the following one with encoding:
#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str1=f.read()
print str1
f.close()
However, it show me an error message. UnicodeEncodeError: 'cp950 codec cant't encode character u'\u76d8' in position 50:illegal mutibyte sequence.
Why is that error happened? How to solve it? I have tried other unicode like Big5,cp950... but it still not work.
Upvotes: 5
Views: 10364
Reputation: 51
with open('chinese.txt','r+b') as inputFile:
bytes = inputFile.read()
print(bytes.decode('utf8'))
Upvotes: 0
Reputation: 59426
Python (at least before Python 3.0) knows two kinds of string: ① a byte array and ② a character array.
Characters as in ② are Unicode, the type of these kind of strings is also called unicode
.
The bytes in ① (type named str
in Python) can be a printable string or something else (binary data). If it's a printable string, it also can be an encoded version (e. g. UTF-8, latin-1 or similar) of a string of Unicode characters. Then several bytes can represent a single character.
In your usecase I'd propose to read the file as a list of bytes:
with open('filename.txt') as inputFile:
bytes = inputFile.read()
Then convert that byte array to a decent Unicode string by decoding it from the encoding used in the file (you will have to find that out!):
unicodeText = bytes.decode('utf-8')
Then print it:
print unicodeText
The last step depends on the capabilities of your output device (xterm, …). It may be capable of displaying Unicode characters, then everything is fine and the characters get displayed properly. But it might be incapable of Unicode, or, more likely, Python is just not well-informed about the capabilities, then you will get an error message. This also will happen if you redirect your output into a file or pipe it into a second process.
To prevent this trouble, you can convert the Unicode string into a byte-array again, choosing an encoding of your choice:
print unicodeText.encode('utf-8')
This way you will only print bytes which every terminal, output file and second process (when piping) can handle.
If input and output encoding are the same, then of course, you won't have to decode and encode anything. But since you have some trouble, most likely the encodings differ, so you will have to do these two steps.
Upvotes: 1
Reputation: 44354
It is the terminal system you are using to display the character. Using IDLE on Windows 7 and it works fine:
>>> val = u'\u76d8'
>>> print val
盘
but if I use cmd.exe
then I get your error.
Use a terminal display method that supports unicode encoding.
Upvotes: 7
Reputation: 2544
JUst TRY:
f=open(userinput,'r')
print f.read().decode('gb18030').encode('u8')
Upvotes: -1
Reputation: 22905
Code page 936 is the only one that has character 0x76D8 (which encodes to 0xC5CC). You need to use gbk or cp936
Upvotes: 0