Steve
Steve

Reputation: 755

What character encoding is used by fopen() or open()?

When you use a function like fopen(), you have to pass it a string argument for the filename. I want to know what the character encoding of this string should be.

This question has already been asked here, but it has contradictory answers. One answer says the following:

It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.

While another answer says the following:

Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).

This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.

Which answer is the correct one?

Upvotes: 2

Views: 1634

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 754760

They're both correct in some ways.

The strings passed to the file system calls are a string of bytes, with a null byte marking the end of the string and '/' used to separate path components. Within the file name segments, the meaning of the bytes is immaterial to the file system — they're just a sequence of bytes.

How the bytes that form the file name are displayed depends on the equipment used to display them. If the names use UTF-8 with non-ASCII characters, printing that data using ISO 8859-15 (or 8859-1 for intransigent residents of the USA) yields gibberish, often including C1 control bytes from the byte range 0x80 .. 0x9F. If the names use 8859-15 with non-ASCII characters, there will be sequences that are not valid UTF-8 and you will get illegible or meaningless data displayed (question marks, or other indications of invalid UTF-8 sequences).

Upvotes: 4

Related Questions