zqtan98
zqtan98

Reputation: 23

Identical looking string but different bytes representation

The upper string is typed by me while the bottom string is pulled from a database.

bytes('TOYOTA', 'utf-8')
>> b'TOYOTA'

bytes('ΤΟΥΟΤΑ', 'utf-8')
>> b'\xce\xa4\xce\x9f\xce\xa5\xce\x9f\xce\xa4\xce\x91'

This causes undesirable results when I want to check for its existence

'TOYOTA' == 'ΤΟΥΟΤΑ'
>> False

Any idea how to "fix" the incorrect string?

Upvotes: 1

Views: 85

Answers (1)

mkrieger1
mkrieger1

Reputation: 23117

It appears those are Greek capital letters:

>>> import unicodedata
>>> s = 'ΤΟΥΟΤΑ'
>>> for c in s:
...     print(unicodedata.name(c))
... 
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER UPSILON
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER ALPHA

You could try to use one of the available third-party libraries to do a transliteration to the Latin alphabet, for example:

This is a similar question: How can I create a string in english letters from another language word?

Upvotes: 2

Related Questions