Reputation: 29912

Python encoding - Is there any explanation?

Can someone explain to me why python has this behaviour?

Let's me explain.

BACKGROUND

I have a python installation and I want to use some chars that aren't in the ASCII table. So I change my python default enconding. I save every string, into a file .py, in that way '_MAIL_TITLE_': u'Бронирование номеров',

Now, with a method that replaces my dictionary keys, I want to insert into an html template my strings in a dynamic way.

I place into html page's header:

<head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
 ...... <!-- Some Css's --> 
</head>

Unfortunately, my html doc comes to me (after those replaces) with some wrong chars (unconverted? misconverted?)

So, I open a terminal and start to make some order:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xe8\xe9\xf2\xe7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 - '\xc3\xa8'

QUESTION

Take a look at line [7-10]. Isn't that weird? Why if my (line 6) python has a defaultencoding utf-8, does it convert that string (line7) in a different way than line 9 does? Now, take a look at lines [11-14] and their output.

Now, i'm totally confused!

THE HINT

So, I've tried to change my terminal way of input files (previously ISO-8859-1, now utf-8) and something changed:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 -'\xc3\xa8'

So, the encoding (explicit encoding) works independently from input encoding (or it seems to me, but I'm stuck on this for days, so maybe I messed up my mind).

WHERE IS THE SOLUTION??

By looking at lines 8 of background and hint, you can see that there are some differences of unicode's object that are created. So, I've started to thought about it. What have I concluded? Nothing. Nothing except that, maybe, my encoding problems are into file's encoding once a save my .py (that, contains all utf-8 characters that have to be inserted into html document)

THE "REAL" CODE

The code does nothing special: it opens an html template, place it into a string, replace place holders with unicode (utf-8ed ? wish yes) strings and save it into another file that will be visualizated from the Internet (yes, my "landing" page have into header utf-8's specifications). I don't have code here because it is scattered into several files, but I'm sure of the program's workflow (by tracing it).

FINAL QUESTION

In the light of this, does anybody have any idea for making my code work? Ideas about unix file encoding? Or .py file encoding? How can I change the encoding to make my code work?

LAST HINT

Before substitution of place holders with utf-8 object, if I insert a

utf8Obj.encode('latin-1')

my document is perfectly visible for the internet!

Thanks to those who answer.

EDIT1 - DEVELOPMENT WORKFLOW

Ok, that's my development workflow:

I have a CVS for that project. The project is located onto a centos OS. That server is a 64-bit machine. I develop my code into a Windows 7 (64-bit) with eclipse. Every modification is committed ONLY with CVS commit. The code is exectude onto Centos machine that use that kind of python:

Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2

I setted Eclipse for work in that way: PREFERENCES -> GENERAL -> WORKSPACE -> TEXT FILE ENCODING : UTF-8

A Zope/Plone application run onto the same Server: it serves some PHP pages. PHP pages calls some python methods (application logic) by WS that are located onto Zope/Plone "server". That server interface directly to application logic.

That's all

EDIT2

This is the function that does the replace:

    def _fillTemplate(self, buf):
    """_fillTemplate(buf)-->str
    Ritorna il documento con i campi sostituiti con dict_template.
    """
    try:    
        for k, v in self.dict_template.iteritems():
            if not isinstance(v,unicode):
                v=str(v)
            else:
                v=v.encode('latin-1') #In that way it works, but why?
            buf = buf.replace(k, v)

Upvotes: 3

Answers (3)

Eric O. Lebigot

Reputation: 94485

In order to solve this and future problems, I would advise that you look at the answers to question UnicodeDecodeError when redirecting to file, which contains a general discussion of what this encoding/decoding business is about.

In the first example, your terminal encodes in Latin1:

7 - >>> u'èéòç'
8 - u'\xe8\xe9\xf2\xe7'

The encoding of these characters in Latin1 is a valid encoding of the same characters in UTF-8, so Python does not need to do any conversion. When you switch your terminal to UTF-8, you get

7 - >>> u'èéòç'
8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

Your terminal sends UTF-8 encodings to Python, as four 2-byte sequences. Your Python interpreter took these bytes verbatim and kept them: they are also a valid encoded representation of your string; UTF-8 can in fact encode the same characters in multiple ways.

If your editor saves UTF-8, then you should put the following on top of your .py file:

# -*- coding: utf-8 -*-

This line must match the encoding used by your editor.

The most robust approach to handling encodings, is probably one of the following two:

Your program should only manipulate internally (byte) strings in a single encoding (UTF-8 is a good choice). This means that if you get, say, Latin-1-encoded data, you should re-encode it into UTF-8:
```
data.decode('latin1').encode('utf8')
```
The best way of handling your string literals, in this case, is to have your editor save your file in UTF-8 and use the regular (byte) string literals ("This is a string", with no u in front).
Your program can alternatively only manipulate Unicode strings. My experience is that this is a little cumbersome, with Python 2. This would be my method of choice with Python 3, though, because Python 3 has a much more natural support for these encoding issues (litteral strings are character strings, not byte strings, etc.).

Upvotes: 5

sth

Reputation: 229593

In Line 7 you output a Unicode object:

>>> u'èéòç'
u'\xe8\xe9\xf2\xe7'

No encoding happens, it just tells you that your input consists of the Unicode code units \xe8, \xe9, and so on.

In line 11 you create a UTF-8 encoded string from the Unicode object. Output of the encoded string looks different from the unencoded Unicode object, but why wouldn't it:

>>> u'èéòç'.encode('utf-8')
'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

In your second experiment, where you changed the terminal encoding, you actually broke the interpretation of input characters:

>>> u'èéòç'
u'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

When you put in those four characters in the string, they got encoded somewhere in some way and Python then thinks you had typed in eight UTF-8 code unit bytes. But those bytes don't represent the characters you wanted to type in. It looks like Python thinks it will gets ISO-8859-1 characters from the terminal while it actually gets UTF-8 data, resulting in a mess.

Upvotes: 3

Bite code

Reputation: 596743

While you answer to my comment, here is the answer of the first question:

Take a look to line [7-10]. Isn't weird? Why if my (line 6) python have a defaultencoding in utf-8, then convert that string (line7) in a different way that line 9 does? Now, take a look to lines [11-14] and their output..

No it's not weird: you must distinguish between Python encoding, shell encoding, system encoding, file encoding, declared file encoding and applied encoding. Makes a lot of of encoding, isn't it ?

sys.getdefaultencoding()

This will give you the encoding Python use for the unicode implementation. This as nothing to do with output.

In [7]: u'è'
Out[7]: u'\xe8'
In [8]: u'è'.encode('utf8')
Out[8]: '\xc3\xa8'
In [9]: print u'è'
è
In [10]: print u'è'.encode('utf8')
è

When you use print, the caracter is printed to the screen, if you don't, Python gives you the a representation that you can copy/paste to obtain the same data.

Since a unicode string is not the same as a utf8 string, it doesn't give you the same data.

Unicode is a "neutral" representation of the string, while utf8 is an encoded one.

Upvotes: 5

Python encoding - Is there any explanation?

Answers (3)

Related Questions