Python UTF-8 Latin-1 displays wrong character

Question

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).

I tried a method like this:

def latin1_to_unicode(character):

    uni = character.decode('latin-1').encode("utf-8")
    retutn uni

It works fine for characters that are not specific to the latin-1 set, but if I try the following example:

print latin1_to_Unicode('å')

It returns Ã¥ instead of å. Same goes for other letters like æ and ø.

Can anyone please explain why this is happening? Thanks

I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem

Martijn Pieters · Accepted Answer

Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Python UTF-8 Latin-1 displays wrong character

Answers (1)

Related Questions