Patric Hartmann
Patric Hartmann

Reputation: 748

PyGame: Proper use of Unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.

My PyGame file's header:

#!/usr/bin/python
# -*- coding: utf-8 -*-

Later on, I list the three German umlauts:

u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')

The txt-file is parsed by this function: def load_list(listname):

fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
    lines = name.readlines()
for x in range(0, len(lines)):
    lines[x] = lines[x].strip("\n")
    lines[x] = lines[x].strip("\r")
print lines

I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.

How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.

Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )

Upvotes: 1

Views: 1130

Answers (1)

lenz
lenz

Reputation: 5818

I'll try to summarize:

  • Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
  • It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
  • The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.

Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

Upvotes: 2

Related Questions