Python UTF8 encoding

Question

I have looked at other questions around Python and encoding but not quite found the solution to my problem. Here it is:

I have a small script which attempts to compare 2 lists of files:

A list given in a text file, which is supposed to be encoded in UTF8 (at least Notepad++ detects it as such).

A list from a directory which I build like this:

local = [f.encode('utf-8') for f in listdir(dir) ]

However, for some characters, I do not get the same representation: when looking in a HEX editor, I find that in 1, the character é is given by 65 cc whereas in 2 it is given by c3 a9 ...

What I would like is to have them to the same encoding, whatever it is.

Matteo Italia · Accepted Answer

Your first sequence is incomplete - cc is the prefix for a two-byte UTF-8 sequence. Most probably, the full sequence is 65 cc 81, which indeed is the character e (0x65) followed by a COMBINING ACUTE ACCENT (0x301, which in UTF-8 gets expressed as cc 81).

The other sequence instead is the precomposed LATIN SMALL LETTER E WITH ACUTE character (0xe9, expressed as c3 a9 in UTF-8). You'll notice in the linked page that its decomposition is exactly the first sequence.

Unicode normalization

Now, in Unicode there are many instances of different sequences that graphically and/or semantically are the same, and while it's generally a good idea to treat a UTF-8 stream as an opaque binary sequence, this poses a problem if you want to do searching or indexing - looking for one sequence won't match the other, even if they are graphically and semantically the same thing. For this reason, Unicode defines four types of normalization, that can be used to "flatten" this kind of differences and obtain the same codepoints from both the composed and decomposed forms. For example, the NFC and NFKC normalization forma in this case will give the 0xe9 code point for both your sequences, while the NFD and NFKD will give the 0x65 0x301 decomposed form.

To do this in Python you'll have first to decode your UTF-8 str objects to unicode objects, and then use the unicodedata.normalize method.

Important note: don't normalize unless you are implementing "intelligent" indexing/searching, and use the normalized data only for this purpose - i.e index and search normalized, but store/provide to the user the original form. Normalization is a lossy operation (some forms particularly so), applying it blindly over user data is like entering with a sledgehammer in a pottery shop.

File paths

Ok, this was about Unicode in general. Talking about filesystem paths is both simpler and more complicated.

In line of principle, virtually all common filesystems on Windows and Linux treat paths as opaque character¹ sequences (modulo the directory separator and possibly the NUL character), with no particular normalization form applied². So, in a given directory you can have two file names that look the same but are indeed different:

So, when dealing with file paths in line of principle you should never normalize - again, file paths are an opaque sequence of code points (actually, an opaque sequence of bytes on Linux) which should not be messed with.

However, if the list you receive and you have to deal with is normalized differently (which probably means that either it has been passed through a broken software that "helpfully" normalizes composed/decomposed sequences, or that the name has been typed in by hand) you'll have to perform some normalized matching.

If I were to deal with a similar (broken by definition) scenario, I'd do something like this:

first try to match exactly;
if this fails, try to match the normalized file name against a set containing the normalized content of the directory; notice that, if multiple original names are mapped to the same normalized name and you don't match it exactly you have no way to know which one is the "right one".

Footnotes

Linux-native filesystems all use 8-bit byte-based paths - they may be in whatever encoding, the kernel doesn't care, although recent systems generally happen to use UTF-8; Windows-native filesystem will instead use 16-bit word-based paths, which nominally contain UTF-16 (originally UCS-2) values.
On Windows it's a bit more complicated at the API level, since there's the whole ANSI API mess that performs codepage conversion, and case-insensitive matching for Win32 paths adds one more level of complication, but down at kernel and filesystem level it's all opaque 2-byte WCHAR strings.

Python UTF8 encoding

Answers (2)

Unicode normalization

File paths

Footnotes

Related Questions