Reputation: 33
I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their extension. Some of the words in one list are the same as those from the other. I tried to find the matches using re.search(ur'(?iu)\b%s\b' % string1, string2)
, fnmatch, and even simple string1 == string2
type comparisons, all of which worked when typing the first list myself for testing, but failed when using the actual list of words retrieved from a text file.
While performing a test to find out why, I monitored the vietnamese word chào
, present in both lists. Using isinstance(string, unicode)
confirmed they were both unicode. However, using repr()
on both strings returned u'ch\xe0o'
in one case and u'cha\u0300o'
in the other. So it's pretty clear why they won't match.
So I seem to have found the cause, but I'm not sure how to fix this. I tried using .decode('utf-8')
as I thought \xe0
could be utf-8. But all it did was returning a Unicode encode error. Beside, if both strings are unicode and represent the same word, shouldn't they be the same? Doing print('%s Vs. %s' % (string1, string2))
returns chào Vs. chào
I'm kind of lost here.
Many thanks in advance for your help.
Upvotes: 3
Views: 201
Reputation: 3186
The problem seems to be in an ambiguous representation of grave accents in unicode. Here is LATIN SMALL LETTER A WITH GRAVE and here is COMBINING GRAVE ACCENT which when combined with 'a' becomes more or less the exact same character as the first. So two representations of the same character. In fact unicode has a term for this: unicode equivalence.
To implement this in python, use unicodedata.normalize on the string before comparing. I tried 'NFC' mode which returns u'ch\xe0o' for both strings.
Upvotes: 4
Reputation: 308462
Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300
is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.
The process of fixing a string to a common representation is called normalization. You can use the unicodedata
module to do this:
def n(str):
return unicodedata.normalize('NFKC', str)
>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True
Upvotes: 4