Reputation: 55
I enrolled into a Chinese Studies course some time ago, and I thought it'd be a great exercise for me to write a flashcard program in python. I'm storing the flash card lists in a dictionary in a .txt file, so far without trouble. The real problems kick in when I try to load the file, encoded in utf-8, into my program. An excerpt of my code:
import codecs
f = codecs.open(('list.txt'),'r','utf-8')
quiz_list = eval(f.read())
quizy = str(quiz_list).encode('utf-8')
print quizy
Now, if for example list.txt consists of:
{'character1':'男人'}
what is printed is actually
{'character1': '\xe7\x94\xb7\xe7\x86\xb1'}
Obviously there are some serious encoding issues here, but I cannot for the life of me understand where these occur. I am working with a terminal which supports utf-8, so not the standard cmd.exe: this is not the problem. Reading a normal list.txt without the curly dict-bits returns the chinese characters without a problem, so my guess is I'm not handling the dictionary part correctly. Any thoughts would be greatly appreciated!
Upvotes: 3
Views: 351
Reputation: 43061
There's nothing wrong with your encoding... Look at this:
>>> d = {1:'男人'}
>>> d[1]
'\xe7\x94\xb7\xe4\xba\xba'
>>> print d[1]
男人
One thing is to print a unicode string another one is printing its representation.
Upvotes: 3
Reputation: 204926
str(quizy)
calls repr(quizy['character1'])
which produces an ASCII representation of the string value. If you just print quizy['character1']
you'll see that the character codes are Unicode in the Python string.
Upvotes: 2