Reputation: 2797
Python treats words МАМА
and MAMA
differently because one of them is written using latin and another using cyrillian.
How to make python treat them as one same string?
I only care about allomorphs.
Upvotes: 3
Views: 105
Reputation: 28997
Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.
The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.
Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.
Upvotes: 2
Reputation: 37539
There is a python library that will do the cyrillic to latin unicode translations called transliterate
>>> from transliterate import translit
>>>
>>> cy = u'\u041c\u0410\u041c\u0410'
>>> en = u'MAMA'
>>> cy == en
False
>>> cy_converted = translit(cy, 'ru', reversed=True)
>>> cy_converted == en
True
>>> cy_converted
u'MAMA'
Upvotes: 3
Reputation: 422
You might want to use normalize method. https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
Upvotes: 0