user1532534
user1532534

Reputation: 23

Python encoding issue?

I need to test if a certain string (for example 'võiks') equals the name of any of the files contained in a directory.

>>>words = [ f.replace('.html', '') for f in listdir('lemma_pages/test') if isfile(join('lemma_pages/test',f)) ]

>>>words
['võibolla', 'võid', 'võiks', 'võimalik', 'võin', 'võta', 'võtan', 'võtta']

>>>'võiks' in words
False

But when I test for it, I get False when I expected otherwise. I am opening the file containing the words in this way:

open('et_500.txt', 'rt', encoding="utf-8")

Any idea of what I am not doing right ?

Upvotes: 0

Views: 54

Answers (1)

R Samuel Klatchko
R Samuel Klatchko

Reputation: 76541

The data may not be normalized. Before comparing the strings, normalize with:

data = unicodedata.normalize('NFC', data)

To provide some more details, õ could be U+00F5 (LATIN SMALL LETTER O WITH TILDE) or it could be U+0062 (LATIN SMALL LETTER B) followed by U+0303 (COMBINING TILDE). Normalizing is necessary so that no matter which flavor you get, they will compare identically.

Upvotes: 2

Related Questions