How to detect unicode file names in Linux

Question

I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?

Cheers and hth. - Alf · Accepted Answer

utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).

howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.

and so ideally you should really ignore what those bytes might encode.

might be difficult to reconcile with naïve user’s view, though

How to detect unicode file names in Linux

Answers (1)

Related Questions