Identical looking string but different bytes representation

Question

The upper string is typed by me while the bottom string is pulled from a database.

bytes('TOYOTA', 'utf-8')
>> b'TOYOTA'

bytes('ΤΟΥΟΤΑ', 'utf-8')
>> b'\xce\xa4\xce\x9f\xce\xa5\xce\x9f\xce\xa4\xce\x91'

This causes undesirable results when I want to check for its existence

'TOYOTA' == 'ΤΟΥΟΤΑ'
>> False

Any idea how to "fix" the incorrect string?

mkrieger1 · Accepted Answer

It appears those are Greek capital letters:

>>> import unicodedata
>>> s = 'ΤΟΥΟΤΑ'
>>> for c in s:
...     print(unicodedata.name(c))
... 
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER UPSILON
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER ALPHA

You could try to use one of the available third-party libraries to do a transliteration to the Latin alphabet, for example:

This is a similar question: How can I create a string in english letters from another language word?

Identical looking string but different bytes representation

Answers (1)

Related Questions