user2584342
user2584342

Reputation: 31

How to transliterate Cyrillic to Latin using Python 2.7? - not correct translation output

I am trying to transliterate Cyrillic to Latin from an excel file. I am working from the bottom up and can not figure out why this isn't working.
When I try to translate a simple text string, Python outputs "EEEEE EEE" instead of the correct translation. How can I fix this to give me the right translation?? I have been trying to figure this out all day!

symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
           u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")

tr = {ord(a):ord(b) for a, b in zip(*symbols)}

text = u'Добрый Ден'
print text.translate(tr)

>>EEEEEE EEE

I appreciate the help!

Upvotes: 1

Views: 3027

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124110

Your source input is wrong. However you entered your source and text literals, Python did not read the right unicode codepoints.

Instead, I strongly suspect something like the PYTHONIOENCODING variable has been set with the error handler set to replace. This causes Python to replace all codepoints that it does not recognize with question marks. All cyrillic input is treated as not-recognized.

As a result, the only codepoint in your translation map is 63, the question mark, mapped to the last character in symbols[1] (which is expected behaviour for the dictionary comprehension with only one unique key):

>>> unichr(63)
u'?'
>>> unichr(69)
u'E'

The same problem applies to your text unicode string; it too consists of only question marks. The translation mapping replaces each with the letter E:

>>> u'?????? ???'.translate({63, 69})
u'EEEEEE EEE'

You need to either avoid entering Cyrillic literal characters or fix your input method.

In the terminal, this is a function of the codec your terminal (or windows console) supports. Configure the correct codepage (windows) or locale (POSIX systems) to input and output an encoding that supports Cyrillic (UTF-8 would be best).

In a Python source file, tell Python about the encoding used for string literals with a codec comment at the top of the file.

Avoiding literals means using Unicode escape sequences:

symbols = (
    u'\u0430\u0431\u0432\u0433\u0434\u0435\u0451\u0437\u0438\u0439\u043a\u043b\u043c'
    u'\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u044a\u044b\u044c\u044d'
    u'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0417\u0418\u0419\u041a\u041b\u041c'
    u'\u041d\u041e\u041f\u0420\u0421\u0422\u0423\u0424\u0425\u042a\u042b\u042c\u042d',
    u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E"
)
tr = {ord(a):ord(b) for a, b in zip(*symbols)}

text = u'\u0414\u043e\u0431\u0440\u044b\u0439 \u0414\u0435\u043d'

print text.translate(tr)

Upvotes: 4

Related Questions