EsoEsMiNombre
EsoEsMiNombre

Reputation: 33

Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their extension. Some of the words in one list are the same as those from the other. I tried to find the matches using re.search(ur'(?iu)\b%s\b' % string1, string2), fnmatch, and even simple string1 == string2 type comparisons, all of which worked when typing the first list myself for testing, but failed when using the actual list of words retrieved from a text file.

While performing a test to find out why, I monitored the vietnamese word chào, present in both lists. Using isinstance(string, unicode) confirmed they were both unicode. However, using repr() on both strings returned u'ch\xe0o' in one case and u'cha\u0300o' in the other. So it's pretty clear why they won't match.

So I seem to have found the cause, but I'm not sure how to fix this. I tried using .decode('utf-8') as I thought \xe0 could be utf-8. But all it did was returning a Unicode encode error. Beside, if both strings are unicode and represent the same word, shouldn't they be the same? Doing print('%s Vs. %s' % (string1, string2)) returns chào Vs. chào I'm kind of lost here.

Many thanks in advance for your help.

Upvotes: 3

Views: 201

Answers (2)

Andrew Johnson
Andrew Johnson

Reputation: 3186

The problem seems to be in an ambiguous representation of grave accents in unicode. Here is LATIN SMALL LETTER A WITH GRAVE and here is COMBINING GRAVE ACCENT which when combined with 'a' becomes more or less the exact same character as the first. So two representations of the same character. In fact unicode has a term for this: unicode equivalence.

To implement this in python, use unicodedata.normalize on the string before comparing. I tried 'NFC' mode which returns u'ch\xe0o' for both strings.

Upvotes: 4

Mark Ransom
Mark Ransom

Reputation: 308462

Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300 is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.

The process of fixing a string to a common representation is called normalization. You can use the unicodedata module to do this:

def n(str):
    return unicodedata.normalize('NFKC', str)

>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True

Upvotes: 4

Related Questions