Paul R
Paul R

Reputation: 2797

Detect same words using different alphabets?

Python treats words МАМА and MAMA differently because one of them is written using latin and another using cyrillian.

How to make python treat them as one same string?

I only care about allomorphs.

Upvotes: 3

Views: 105

Answers (3)

Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.

The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.

Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.

Upvotes: 2

Brendan Abel
Brendan Abel

Reputation: 37539

There is a python library that will do the cyrillic to latin unicode translations called transliterate

>>> from transliterate import translit
>>> 
>>> cy = u'\u041c\u0410\u041c\u0410'
>>> en = u'MAMA'
>>> cy == en
False
>>> cy_converted = translit(cy, 'ru', reversed=True)
>>> cy_converted == en
True
>>> cy_converted
u'MAMA'

Upvotes: 3

dannyxn
dannyxn

Reputation: 422

You might want to use normalize method. https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

Upvotes: 0

Related Questions