Reputation: 29465
I have a python dictionary which contains items that have non-english characters. When I print the dictionary, the python shell does not properly display the non-english characters. How can I fix this?
Upvotes: 5
Views: 17231
Reputation: 387
In python terminal,
>>> "heißen"
is equivalent to
>>> print repr("heißen")
Python documentation on repr in python 2 http://docs.python.org/2/library/functions.html#func-repr is scarse.
As can be seen, both give you 'byte-based' representation of byte-string "heißen", where all bytes, that are more then 127 are \x encoded. This is where from you get
'hei\xc3\x9fen'
unicode's repr() is not much more helpful. It correctly shows 'ß' as a single unincode cherecter '\xdf', but is still unreadable.
Practical solution I found is to use python 3.
http://docs.python.org/3/library/functions.html#repr
the page also says
ascii(object)
As repr(), return a string containing a printable representation of an
object, but escape the non-ASCII characters in the string returned by
repr() using \x, \u or \U escapes. This generates a string similar to
that returned by repr() in Python 2.
which explains things a little bit.
Upvotes: 1
Reputation: 30531
When your application prints hei\xdfen
instead of heißen
, it means you are not actually printing the actual unicode string, but instead, on the string representation of the unicode object.
Let us assume your string ("heißen") is stored into variable called text
. Just to make sure where you are at, check out the type of this variable by calling:
>>> type(text)
If you get <type 'unicode'>
, it means you are not dealing with a string, but instead a unicode
object.
If you do the intuive thing and try to print to text by invoking print(text)
you won't get out the actual text ("heißen") but instead, a string representation of a unicode object.
To fix this, you need to know which encoding your terminal has and print out your unicode object encoded according to the given encoding.
For instance, if your terminal uses UTF-8 encoding, you can print out a string by invoking:
text.encode('utf-8')
That's for the basic concepts. Now let me give you a more detailed example. Let us assume we have a source code file storing your dictionary. Like:
mydict = {'heiße': 'heiße', 'äää': 'ööö'}
When you type print mydict
you will get {'\xc3\xa4\xc3\xa4\xc3\xa4': '\xc3\xb6\xc3\xb6\xc3\xb6', 'hei\xc3\x9fe': 'hei\xc3\x9fe'}
. Even print mydict['äää']
doesn't work: it results in something like ├Â├Â├Â
. The nature of the problem is revealed by trying out print type(mydict['äää'])
which will tell you that you are dealing with a string
object.
In order to fix the problem, you first need to decode the string representation from your source code file's charset to unicode object and then represent it in the charset of your terminal. For individual dict items this can be achived by:
print unicode(mydict, 'utf-8')
Note that if default encoding doesn't apply to your terminal, you need to write:
print unicode(mydict, 'utf-8').encode('utf-8')
Where the outer encode method specifies the encoding according to your terminal.
I really really urge you to read through Joel's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". Unless you understand how character sets work, you will stumble across problems similar to this again and again.
Upvotes: 7
Reputation: 10467
Python 3.0 have default unicode strings and in python 2.x you have to prefix string whit u
u"汉字/漢字 chinese"
Upvotes: 1
Reputation: 29985
Actually, that's not really a Python-related issue.
Your environment variables (I'm assuming that you're on either Linux or Mac) should have the UTF-8 character encoding active.
You should be able to put these in your ~/.profile (or ~/.bashrc) file :
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
-edit-
Actually, Mac uses UTF-8 by default. This is a Windows/Linux issue.
-edit 2-
You should, of course, always use unicode strings, a unicode editor and a unicode doctype. But I'm assuming that you know that :-)
Upvotes: 4