Reputation: 31
I am trying to transliterate Cyrillic to Latin from an excel file. I am working from the bottom up and can not figure out why this isn't working.
When I try to translate a simple text string, Python outputs "EEEEE EEE" instead of the correct translation. How can I fix this to give me the right translation?? I have been trying to figure this out all day!
symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'Добрый Ден'
print text.translate(tr)
>>EEEEEE EEE
I appreciate the help!
Upvotes: 1
Views: 3027
Reputation: 1124110
Your source input is wrong. However you entered your source
and text
literals, Python did not read the right unicode codepoints.
Instead, I strongly suspect something like the PYTHONIOENCODING
variable has been set with the error handler set to replace
. This causes Python to replace all codepoints that it does not recognize with question marks. All cyrillic input is treated as not-recognized.
As a result, the only codepoint in your translation map is 63, the question mark, mapped to the last character in symbols[1]
(which is expected behaviour for the dictionary comprehension with only one unique key):
>>> unichr(63)
u'?'
>>> unichr(69)
u'E'
The same problem applies to your text
unicode string; it too consists of only question marks. The translation mapping replaces each with the letter E
:
>>> u'?????? ???'.translate({63, 69})
u'EEEEEE EEE'
You need to either avoid entering Cyrillic literal characters or fix your input method.
In the terminal, this is a function of the codec your terminal (or windows console) supports. Configure the correct codepage (windows) or locale (POSIX systems) to input and output an encoding that supports Cyrillic (UTF-8 would be best).
In a Python source file, tell Python about the encoding used for string literals with a codec comment at the top of the file.
Avoiding literals means using Unicode escape sequences:
symbols = (
u'\u0430\u0431\u0432\u0433\u0434\u0435\u0451\u0437\u0438\u0439\u043a\u043b\u043c'
u'\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u044a\u044b\u044c\u044d'
u'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0417\u0418\u0419\u041a\u041b\u041c'
u'\u041d\u041e\u041f\u0420\u0421\u0422\u0423\u0424\u0425\u042a\u042b\u042c\u042d',
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E"
)
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'\u0414\u043e\u0431\u0440\u044b\u0439 \u0414\u0435\u043d'
print text.translate(tr)
Upvotes: 4