Reputation: 1411
When we invoke system call in linux like 'open
' or stdio function like 'fopen
' we must provide a 'const char * filename
'. My question is what is the encoding used here? It's utf-8 or ascii or iso8859-x? Does it depend on the system or environment setting?
I know in MS Windows there is a _wopen
which accept utf-16.
Upvotes: 15
Views: 7597
Reputation: 215507
The filename is the byte string; regardless of locale or any other conventions you're using about how filenames should be encoded, the string you must pass to fopen
and to all functions taking filenames/pathnames is the exact byte string for how the file is named. For example if you have a file named ö.txt
in UTF-8 in NFC, and your locale is UTF-8 encoded and uses NFC, you can just write the name as ö.txt
and pass that to fopen
. If your locale is Latin-1 based, though, you can't pass the Latin-1 form of ö.txt
("\xf6.txt"
) to fopen
and expect it to succeed; that's a different byte string and thus a different filename. You would need to pass "\xc3\xb6.txt"
("ö.txt"
if you interpret that as Latin-1), the same byte string as the actual name.
This situation is very different from Windows, which you seem to be familiar with, where the filename is is a sequence of 16-bit units interpreted as UTF-16 (although AFAIK they need not actually be valid UTF-16) and filenames passed to fopen
, etc. are interpreted according to the current locale as Unicode characters which are then used to open/access the file based on its UTF-16 name.
Upvotes: 1
Reputation: 2077
As already mentioned above, this will be a byte string and the interpretation will be open to the underlying system. More specifically, imagine to C functions; one in user space and one in kernel space which take char *
as their parameter. The encoding in user space will depend upon the execution character set of the user program (eg. specified by -fexec-charset=charset
in gcc). The encoding expected by the kernel function depends upon the execution charset used during kernel compilation (not sure where to get that information).
Upvotes: 0
Reputation: 137
I did some further inquiries on this topic and came to the conclusion that there are two different ways how filename encoding can be handled by unixoid file systems.
File names are encoded in the "sytem locale", which usually is, but needs not to be the same as the current environment locale that is reflected by the locale
command (but some preset in a global configuration file).
File names are encoded in UTF-8, independent from any locale settings.
GTK+ solves this mess by assuming UTF-8 and allowing to override it either by the current locale encoding or a user-supplied encoding.
Qt solves it by assuming locale encoding (and that system locale is reflected in the current locale) and allowing to override it with a user-supplied conversion function.
So the bottom line is: Use either UTF-8 or what LC_ALL or LANG tell you by default, and provide an override setting at least for the other alternative.
Upvotes: -1
Reputation: 64434
Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).
This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.
Upvotes: 5
Reputation: 6048
It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.
Upvotes: 4
Reputation: 34662
It's a byte string, the interpretation is up to the particular filesystem.
Upvotes: 9