xeranic
xeranic

Reputation: 1411

What encoding used when invoke fopen or open?

When we invoke system call in linux like 'open' or stdio function like 'fopen' we must provide a 'const char * filename'. My question is what is the encoding used here? It's utf-8 or ascii or iso8859-x? Does it depend on the system or environment setting?

I know in MS Windows there is a _wopen which accept utf-16.

Upvotes: 15

Views: 7597

Answers (6)

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215507

The filename is the byte string; regardless of locale or any other conventions you're using about how filenames should be encoded, the string you must pass to fopen and to all functions taking filenames/pathnames is the exact byte string for how the file is named. For example if you have a file named ö.txt in UTF-8 in NFC, and your locale is UTF-8 encoded and uses NFC, you can just write the name as ö.txt and pass that to fopen. If your locale is Latin-1 based, though, you can't pass the Latin-1 form of ö.txt ("\xf6.txt") to fopen and expect it to succeed; that's a different byte string and thus a different filename. You would need to pass "\xc3\xb6.txt" ("ö.txt" if you interpret that as Latin-1), the same byte string as the actual name.

This situation is very different from Windows, which you seem to be familiar with, where the filename is is a sequence of 16-bit units interpreted as UTF-16 (although AFAIK they need not actually be valid UTF-16) and filenames passed to fopen, etc. are interpreted according to the current locale as Unicode characters which are then used to open/access the file based on its UTF-16 name.

Upvotes: 1

tinkerbeast
tinkerbeast

Reputation: 2077

As already mentioned above, this will be a byte string and the interpretation will be open to the underlying system. More specifically, imagine to C functions; one in user space and one in kernel space which take char * as their parameter. The encoding in user space will depend upon the execution character set of the user program (eg. specified by -fexec-charset=charset in gcc). The encoding expected by the kernel function depends upon the execution charset used during kernel compilation (not sure where to get that information).

Upvotes: 0

following
following

Reputation: 137

I did some further inquiries on this topic and came to the conclusion that there are two different ways how filename encoding can be handled by unixoid file systems.

  1. File names are encoded in the "sytem locale", which usually is, but needs not to be the same as the current environment locale that is reflected by the locale command (but some preset in a global configuration file).

  2. File names are encoded in UTF-8, independent from any locale settings.

GTK+ solves this mess by assuming UTF-8 and allowing to override it either by the current locale encoding or a user-supplied encoding.

Qt solves it by assuming locale encoding (and that system locale is reflected in the current locale) and allowing to override it with a user-supplied conversion function.

So the bottom line is: Use either UTF-8 or what LC_ALL or LANG tell you by default, and provide an override setting at least for the other alternative.

Upvotes: -1

JesperE
JesperE

Reputation: 64434

Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).

This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.

Upvotes: 5

Matthew Talbert
Matthew Talbert

Reputation: 6048

It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.

Upvotes: 4

Andrew McGregor
Andrew McGregor

Reputation: 34662

It's a byte string, the interpretation is up to the particular filesystem.

Upvotes: 9

Related Questions