Reputation: 2726
I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints 'correct' and continue. If the value is wrong, it prints 'wrong' and breaks.
I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let's say I have a file 'dictionary.txt' with this text:
cat:Katze
dog:Hund
exercise:Übung
solve:lösen
door:Tür
cheese:Käse
And I have this code just to test how the output looks like:
# -*- coding: UTF-8 -*-
words = {} # empty dictionary
with open('dictionary.txt') as my_file:
for line in my_file.readlines():
if len(line.strip())>0: # ignoring blank lines
elem = line.split(':') # split on ":"
words[elem[0]] = elem[1].strip() # appending elements to dictionary
print words
Obviously the result of the print is not as expected:
{'cheese': 'K\xc3\xa4se', 'door': 'T\xc3\xbcr',
'dog': 'Hund', 'cat': 'Katze', 'solve': 'l\xc3\xb6sen',
'exercise': '\xc3\x9cbung'}
So where do I add the encoding and how do I do it?
Thank you!
Upvotes: 2
Views: 29166
Reputation: 1
def game(input,answer):
if input == answer:
sentence = "You got it!"
return sentence
elif input != answer:
wrong = "sorry, wrong answer"
return wrong
Upvotes: -3
Reputation: 1123970
You are looking at byte string values, printed as repr()
results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr()
to ease debugging.
Thus, the string 'K\xc3\xa4se' contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.
You should decode the values to unicode
objects:
with open('dictionary.txt') as my_file:
for line in my_file: # just loop over the file
if line.strip(): # ignoring blank lines
key, value = line.decode('utf8').strip().split(':')
words[key] = value
or better still, use codecs.open()
to decode the file as you read it:
import codecs
with codecs.open('dictionary.txt', 'r', 'utf8') as my_file:
for line in my_file:
if line.strip(): # ignoring blank lines
key, value = line.strip().split(':')
words[key] = value
Printing the resulting dictionary will still use repr()
results for the contents, so now you'll see u'cheese': u'K\xe4se'
instead, because \xe4
is the escape code for Unicode point 00E4, the ä
character. Print individual words if you want the actual characters to be written to the terminal:
print words['cheese']
But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print
will do this automatically, for example, when printing unicode values to your terminal.
You may want to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Upvotes: 6