Lost with encodings (shell and accents)

Question

I'm having trouble with encodings. I'm using version

Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2

I have chars with accents like é à. My scripts uses utf-8 encoding

#!/usr/bin/python
# -*- coding: utf-8 -*-

Users can type strings usings raw_input() with .

def rlinput(prompt, prefill=''):
    readline.set_startup_hook(lambda: readline.insert_text( prefill))
    try:
        return raw_input(prompt)
    finally:
        readline.set_startup_hook()

called in the main loop 'pseudo' shell

while to_continue : 
    to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
    os.system('clear')
    print T, u"
" + feedback

Data are stored as pickle in files.

I managed to have the app working but finaly get stupid things like

core file :

class Task()
...
def __str__(self):
    r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
    return r.encode('utf-8')

and so in shell file :

feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"

That's where i realize that i might be totaly wrong with encoding/decoding. So I tried to decode directly in rlinput but failed. I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html Waiting for my python book, i'm lost :/

I guess there is a lot of bad code but my question here is only related to encoding issus. You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm

wberry · Accepted Answer

Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.

Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)

If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.

Lost with encodings (shell and accents)

Answers (2)

Related Questions