7seb
7seb

Reputation: 172

Lost with encodings (shell and accents)

I'm having trouble with encodings. I'm using version

Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2

I have chars with accents like é à. My scripts uses utf-8 encoding

#!/usr/bin/python
# -*- coding: utf-8 -*-

Users can type strings usings raw_input() with .

def rlinput(prompt, prefill=''):
    readline.set_startup_hook(lambda: readline.insert_text( prefill))
    try:
        return raw_input(prompt)
    finally:
        readline.set_startup_hook()

called in the main loop 'pseudo' shell

while to_continue : 
    to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
    os.system('clear')
    print T, u"\n" + feedback

Data are stored as pickle in files.

I managed to have the app working but finaly get stupid things like

core file :

class Task()
...
def __str__(self):
    r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
    return r.encode('utf-8')

and so in shell file :

feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"

That's where i realize that i might be totaly wrong with encoding/decoding. So I tried to decode directly in rlinput but failed. I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html Waiting for my python book, i'm lost :/

I guess there is a lot of bad code but my question here is only related to encoding issus. You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm

Upvotes: 5

Views: 1124

Answers (2)

7seb
7seb

Reputation: 172

As @wberry suggested i checked the encodings : ok

$ file --mime-encoding todo_shell.py task.py todo.py
todo_shell.py: utf-8
task.py:       utf-8
todo.py:       utf-8
$ echo $LANG
fr_FR.UTF-8
$ python -c "import sys; print sys.stdin.encoding"
UTF-8

As @eryksun suggested as decoded the user inputs (+encode the strings submited before) (solved some problems if my memory is good) (will test deeply later) :

def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill.encode(sys.stdin.encoding) ))
try:
    return raw_input( prompt ).decode( sys.stdin.encoding )
finally:
    readline.set_startup_hook()

I still hav problems but my question was not well defined so i can't get clear answer. I feel less lost now and have directions to search. Thank you !

EDIT : i replaced the str methodes with unicode and it killed some (all?) probs.

Thanks @eryksun for the tips. (this links helped me : Python __str__ versus __unicode__ )

Upvotes: 0

wberry
wberry

Reputation: 19347

Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.

Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)

If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.

Upvotes: 2

Related Questions