Reputation: 171
These two strings look identical when printed but they are not equal under the hood. I need to select a dictionary item by this key but I get keyError because obviously they do not match. I have tried using str.encode("utf-8"), str.decode("utf-8"), unicode(str, "utf-8"), repr(). Nothing helped. How can I make them equal just like when they are printed? Thanks.
>>> str1 = u"extra\u00f1ar"
>>> str2 = u"extrañar"
>>> str1
u'extra\xf1ar'
>>> str2
u'extran\u0303ar'
>>> print str1
extrañar
>>> print str2
extrañar
>>> str1 == str2
False
Upvotes: 0
Views: 41
Reputation: 95957
You can try to use unicodedata.normalize
, but it isn't guaranteed to work:
>>> str1 = u'extra\xf1ar'
>>> str2 = u'extran\u0303ar'
>>> str1 == str2
False
>>> print str1; print str2
extrañar
extrañar
So, observe:
>>> import unicodedata
>>> unicodedata.normalize('NFC', str1)
u'extra\xf1ar'
>>> unicodedata.normalize('NFC', str2)
u'extra\xf1ar'
>>> unicodedata.normalize('NFC', str2) == unicodedata.normalize('NFC', str2)
True
>>> print unicodedata.normalize('NFC', str2); print unicodedata.normalize('NFC', str2)
extrañar
extrañar
One caveat:
Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.
Upvotes: 2