How to encode/decode this file in Python?

Question

I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints 'correct' and continue. If the value is wrong, it prints 'wrong' and breaks.

I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let's say I have a file 'dictionary.txt' with this text:

cat:Katze
dog:Hund
exercise:Übung
solve:lösen
door:Tür
cheese:Käse

And I have this code just to test how the output looks like:

# -*- coding: UTF-8 -*-
words = {} # empty dictionary
with open('dictionary.txt') as my_file:
  for line in my_file.readlines():
    if len(line.strip())>0: # ignoring blank lines
      elem = line.split(':') # split on ":"
      words[elem[0]] = elem[1].strip() # appending elements to dictionary
print words

Obviously the result of the print is not as expected:

    {'cheese': 'K\xc3\xa4se', 'door': 'T\xc3\xbcr',
     'dog': 'Hund', 'cat': 'Katze', 'solve': 'l\xc3\xb6sen',
     'exercise': '\xc3\x9cbung'}

So where do I add the encoding and how do I do it?

Thank you!

Martijn Pieters · Accepted Answer

You are looking at byte string values, printed as repr() results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr() to ease debugging.

Thus, the string 'K\xc3\xa4se' contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.

You should decode the values to unicode objects:

with open('dictionary.txt') as my_file:
    for line in my_file:   # just loop over the file
        if line.strip(): # ignoring blank lines
            key, value = line.decode('utf8').strip().split(':')
            words[key] = value

or better still, use codecs.open() to decode the file as you read it:

import codecs

with codecs.open('dictionary.txt', 'r', 'utf8') as my_file:
    for line in my_file:
        if line.strip(): # ignoring blank lines
            key, value = line.strip().split(':')
            words[key] = value

Printing the resulting dictionary will still use repr() results for the contents, so now you'll see u'cheese': u'K\xe4se' instead, because \xe4 is the escape code for Unicode point 00E4, the ä character. Print individual words if you want the actual characters to be written to the terminal:

print words['cheese']

But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print will do this automatically, for example, when printing unicode values to your terminal.

You may want to read up on Unicode and Python:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

How to encode/decode this file in Python?

Answers (2)

This is how you should do it.

Related Questions