Kevin
Kevin

Reputation: 4737

Conflict between UTF-8 normalized-forms of encoding for accents

I've got a bug with UTF-8 normalizations:

as far as I understood, there's (at least) two ways to write an 'é' in UTF-8 : CC 81 and C3 A9.

[After a migration from Mac/OSX to a PC/Linux] I now have a conflict between the paths I store in my database and the actual file system structure, which prevents me from accessing correctly my files ...

With the help of java.text.Normalizer, I worked out that in the FS I've got:

NFD true
NFC false
NFKD true
NFKC false

while in the database (and from the keyboard), I have:

NFD false
NFC true
NFKD false
NFKC true

Which of these four normalized-forms shall I comply with? How could I (automatically) fix the encoding of the filesystem directories?


EDIT2: the problem is not at all what I though about at the beginning, hence everything below stroked out.

do you know if there is any rule (RFC ?) defining the handling of file:// URLs?

My concern is about the accents, I try to access a picture at

file:///other/Web/data/images/2005/2005-12-31 Fin d'année/IMGP0012.JPG

but it doesnt' work, EDIT: of course it doesn't work with &eacute in URL ...

however, Gumbo's suggestion

file:///other/Web/data/images/2005/2005-12-31%20Fin%20d'ann%C3%A9e/IMGP0012.JPG

doesn't work either, but (Firefox->Copy Link Location)

file:///other/Web/data/images/2005/2005-12-31%20Fin%20d%27anne%CC%81e

is okay.

is there any standard way to access this data on the local filesystem, or shall I try all the available encoding ... ?

(my code is written in Java and I test it with FF 3.6)

Upvotes: 1

Views: 1071

Answers (2)

Kevin
Kevin

Reputation: 4737

I finally 'normalized' (renamed) my file system directories, according to the names stored in the database, OSX messed everything up !

Upvotes: 1

Gumbo
Gumbo

Reputation: 655499

You need to encode these characters with the percent-encoding. Try this:

file:///other/Web/data/images/2005/2005-12-31%20Fin%20d'ann%C3%A9e/IMGP0012.JPG

Here %C3%A9 represents the é in UTF-8 encoded. Maybe you need to change the character encoding if your application expects a different character encoding than UTF-8.

Upvotes: 4

Related Questions