Reputation: 73
I need to convert unicode files to ascii. In case, a letter doesn't exist in ascii, it should be converted to it's closest ascii representation.
I'm using the Unidecode tool for it (https://pypi.python.org/pypi/Unidecode). It works fine when I use it in the Python interpreter on the CL (thus, by invoking python
and then importing the libraries and then printing the decoded word like this: print unidecode(u'äèß')
)
Unfortunately, when I try to use this tool directly on the command line (thus, by doing something like python -c "from unidecode import *; print unidecode(u'äèß')"
, it only prints gibberish (A$?A"A
to be exact, even though it should've printed (and did in the interpreter) aess
). This is annoying and I don't know how to solve that issue. I thought it might be due to encoding errors with my Terminal, not being set correctly to utf-8 or something. However, locale
in my Terminal printed me the following output:
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL="de_DE.UTF-8"
Or, might it be due to Python that has problems with StdIn encoding on the command line? It gave me correct output in the python interpreter, but when invoking python -c
not.
Do you guys have an idea?
Upvotes: 1
Views: 306
Reputation: 880757
When you type 'äèß' in the terminal, although you see 'äèß', the terminal sees bytes. If your terminal encoding is utf-8
, then it sees the bytes
In [2]: 'äèß'
Out[2]: '\xc3\xa4\xc3\xa8\xc3\x9f'
So when you type
python -c "from unidecode import *; print unidecode(u'äèß')"
at the command line, the terminal (assuming utf-8 encoding) sees
python -c "from unidecode import *; print unidecode(u'\xc3\xa4\xc3\xa8\xc3\x9f')"
That's not the unicode you intended to send to Python.
In [28]: print(u'\xc3\xa4\xc3\xa8\xc3\x9f')
äèÃ
There are a number of ways to work around the problem, perhaps in order of convenience:
Let the terminal change äèß
to \xc3\xa4\xc3\xa8\xc3\x9f
and then
decode it as utf-8
:
% python -c "from unidecode import *; print unidecode('äèß'.decode('utf_8'))"
aess
Declare an encoding as shown in Nehal J. Wani's solution:
% python -c "#coding: utf8
> from unidecode import *; print unidecode(u'äèß')"
aess
This requires writing the command on two lines, however.
Since u'äèß
is equivalent to u'\xe4\xe8\xdf'
you could avoid
the problem by passing u'\xe4\xe8\xdf'
instead:
% python -c "from unidecode import *; print unidecode(u'\xe4\xe8\xdf')"
aess
The problem with doing it this way (obviously) is you have to figure out the hexadecimal code point values.
Or, you could specify the unicode by name:
% python -c "from unidecode import *; print unidecode(u'\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER E WITH GRAVE}\N{LATIN SMALL LETTER SHARP S}')"
aess
Upvotes: 0
Reputation: 16639
If you try writing this in a file:
#!/bin/python
from unidecode import *
print unidecode(u'äèß')
[Wani@Linux tmp]$ python tmp.py
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
[Wani@Linux tmp]$
To fix this, you do:
#!/bin/python
#coding: utf8
from unidecode import *; print unidecode(u'äèß')
[Wani@Linux tmp]$ python tmp.py
aeess
[Wani@Linux tmp]$
So, you need to call from command-line like this:
[Wani@Linux tmp]$ python -c "#coding: utf8
from unidecode import *; print unidecode(u'äèß')"
aeess
[Wani@Linux tmp]$ python -c "$(echo -e "#coding: utf8\nfrom unidecode import *; print unidecode(u'äèß')")"
aeess
[Wani@Linux tmp]
Further Reading: Correct way to define Python source code encoding
Upvotes: 0