Reputation: 119
I have looked at other questions around Python and encoding but not quite found the solution to my problem. Here it is:
I have a small script which attempts to compare 2 lists of files:
A list given in a text file, which is supposed to be encoded in UTF8 (at least Notepad++ detects it as such).
A list from a directory which I build like this:
local = [f.encode('utf-8') for f in listdir(dir) ]
However, for some characters, I do not get the same representation: when looking in a HEX editor, I find that in 1, the character é
is given by 65 cc
whereas in 2 it is given by c3 a9
...
What I would like is to have them to the same encoding, whatever it is.
Upvotes: 2
Views: 749
Reputation: 126937
Your first sequence is incomplete - cc
is the prefix for a two-byte UTF-8 sequence. Most probably, the full sequence is 65 cc 81
, which indeed is the character e
(0x65) followed by a COMBINING ACUTE ACCENT (0x301, which in UTF-8 gets expressed as cc 81
).
The other sequence instead is the precomposed LATIN SMALL LETTER E WITH ACUTE character (0xe9, expressed as c3 a9
in UTF-8). You'll notice in the linked page that its decomposition is exactly the first sequence.
Now, in Unicode there are many instances of different sequences that graphically and/or semantically are the same, and while it's generally a good idea to treat a UTF-8 stream as an opaque binary sequence, this poses a problem if you want to do searching or indexing - looking for one sequence won't match the other, even if they are graphically and semantically the same thing. For this reason, Unicode defines four types of normalization, that can be used to "flatten" this kind of differences and obtain the same codepoints from both the composed and decomposed forms. For example, the NFC and NFKC normalization forma in this case will give the 0xe9 code point for both your sequences, while the NFD and NFKD will give the 0x65 0x301 decomposed form.
To do this in Python you'll have first to decode
your UTF-8 str
objects to unicode
objects, and then use the unicodedata.normalize
method.
Important note: don't normalize unless you are implementing "intelligent" indexing/searching, and use the normalized data only for this purpose - i.e index and search normalized, but store/provide to the user the original form. Normalization is a lossy operation (some forms particularly so), applying it blindly over user data is like entering with a sledgehammer in a pottery shop.
Ok, this was about Unicode in general. Talking about filesystem paths is both simpler and more complicated.
In line of principle, virtually all common filesystems on Windows and Linux treat paths as opaque character1 sequences (modulo the directory separator and possibly the NUL character), with no particular normalization form applied2. So, in a given directory you can have two file names that look the same but are indeed different:
So, when dealing with file paths in line of principle you should never normalize - again, file paths are an opaque sequence of code points (actually, an opaque sequence of bytes on Linux) which should not be messed with.
However, if the list you receive and you have to deal with is normalized differently (which probably means that either it has been passed through a broken software that "helpfully" normalizes composed/decomposed sequences, or that the name has been typed in by hand) you'll have to perform some normalized matching.
If I were to deal with a similar (broken by definition) scenario, I'd do something like this:
set
containing the normalized content of the directory; notice that, if multiple original names are mapped to the same normalized name and you don't match it exactly you have no way to know which one is the "right one".WCHAR
strings.Upvotes: 5
Reputation: 811
At the top of your file add these
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Hope this helps..!
Upvotes: -1