Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

Question

I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their extension. Some of the words in one list are the same as those from the other. I tried to find the matches using re.search(ur'(?iu)\b%s\b' % string1, string2), fnmatch, and even simple string1 == string2 type comparisons, all of which worked when typing the first list myself for testing, but failed when using the actual list of words retrieved from a text file.

While performing a test to find out why, I monitored the vietnamese word chào, present in both lists. Using isinstance(string, unicode) confirmed they were both unicode. However, using repr() on both strings returned u'ch\xe0o' in one case and u'cha\u0300o' in the other. So it's pretty clear why they won't match.

So I seem to have found the cause, but I'm not sure how to fix this. I tried using .decode('utf-8') as I thought \xe0 could be utf-8. But all it did was returning a Unicode encode error. Beside, if both strings are unicode and represent the same word, shouldn't they be the same? Doing print('%s Vs. %s' % (string1, string2)) returns chào Vs. chào I'm kind of lost here.

Many thanks in advance for your help.

Mark Ransom · Accepted Answer

Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300 is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.

The process of fixing a string to a common representation is called normalization. You can use the unicodedata module to do this:

def n(str):
    return unicodedata.normalize('NFKC', str)

>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True

Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

Answers (2)

Related Questions