Encoding issue for Python tool Unidecode on CL

Question

I need to convert unicode files to ascii. In case, a letter doesn't exist in ascii, it should be converted to it's closest ascii representation. I'm using the Unidecode tool for it (https://pypi.python.org/pypi/Unidecode). It works fine when I use it in the Python interpreter on the CL (thus, by invoking python and then importing the libraries and then printing the decoded word like this: print unidecode(u'äèß'))

Unfortunately, when I try to use this tool directly on the command line (thus, by doing something like python -c "from unidecode import *; print unidecode(u'äèß')", it only prints gibberish (A$?A"A to be exact, even though it should've printed (and did in the interpreter) aess). This is annoying and I don't know how to solve that issue. I thought it might be due to encoding errors with my Terminal, not being set correctly to utf-8 or something. However, locale in my Terminal printed me the following output:

LANG="de_DE.UTF-8"

LC_COLLATE="de_DE.UTF-8"

LC_CTYPE="de_DE.UTF-8"

LC_MESSAGES="de_DE.UTF-8"

LC_MONETARY="de_DE.UTF-8"

LC_NUMERIC="de_DE.UTF-8"

LC_TIME="de_DE.UTF-8"

LC_ALL="de_DE.UTF-8"

Or, might it be due to Python that has problems with StdIn encoding on the command line? It gave me correct output in the python interpreter, but when invoking python -c not.

Do you guys have an idea?

unutbu · Accepted Answer

When you type 'äèß' in the terminal, although you see 'äèß', the terminal sees bytes. If your terminal encoding is utf-8, then it sees the bytes

In [2]: 'äèß'
Out[2]: '\xc3\xa4\xc3\xa8\xc3\x9f'

So when you type

python -c "from unidecode import *; print unidecode(u'äèß')"

at the command line, the terminal (assuming utf-8 encoding) sees

python -c "from unidecode import *; print unidecode(u'\xc3\xa4\xc3\xa8\xc3\x9f')"

That's not the unicode you intended to send to Python.

In [28]: print(u'\xc3\xa4\xc3\xa8\xc3\x9f')
Ã¤Ã¨Ã

There are a number of ways to work around the problem, perhaps in order of convenience:

Let the terminal change äèß to \xc3\xa4\xc3\xa8\xc3\x9f and then decode it as utf-8:

% python -c "from unidecode import *; print unidecode('äèß'.decode('utf_8'))"
aess

Declare an encoding as shown in Nehal J. Wani's solution:
```
% python -c "#coding: utf8
> from unidecode import *; print unidecode(u'äèß')" 
aess
```
This requires writing the command on two lines, however.
Since u'äèß is equivalent to u'\xe4\xe8\xdf' you could avoid the problem by passing u'\xe4\xe8\xdf' instead:
```
% python -c "from unidecode import *; print unidecode(u'\xe4\xe8\xdf')"
aess
```
The problem with doing it this way (obviously) is you have to figure out the hexadecimal code point values.

Or, you could specify the unicode by name:

% python -c "from unidecode import *; print unidecode(u'\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER E WITH GRAVE}\N{LATIN SMALL LETTER SHARP S}')"
aess

Encoding issue for Python tool Unidecode on CL

Answers (2)

Related Questions