Ferenc Deak
Ferenc Deak

Reputation: 35448

The C stdio character encoding

For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(char), fp);
}
int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const char* ABCDE = "ABCDE";
    write_to_file(fp, ABCDE, strlen(ABCDE) );
    const char* nor = "BBøæåBB";
    write_to_file(fp, nor, strlen(nor));
    const char* hun = "AAőűéáöüúBB";
    write_to_file(fp, hun, strlen(hun));
    const char* per = "CCبﺙگCC";
    write_to_file(fp, per, strlen(per));
    fclose(fp);
}

It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:

hex dump of standard char* output

I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A (ie: 10) characters (second row, starting with 0A and 8 more characters). And the string itself contains double characters (the ø is encoded as C3 B8 and so on...).

The same is true for the hun and per variables.

I did the same experiment with Unicode, the following is the application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(wchar_t), fp);
}

int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const wchar_t* ABCDE = L"ABCDE";
    write_to_file(fp, ABCDE, wcslen(ABCDE) );
    const wchar_t* nor = L"BBøæåBB";
    write_to_file(fp, nor, wcslen(nor));
    const wchar_t* hun = L"AAőűéáöüúBB";
    write_to_file(fp, hun, wcslen(hun));
    const wchar_t* per = L"CCبﺙگCC";
    write_to_file(fp, per, wcslen(per));
    fclose(fp);
}

The results here are the expected ones. 5 for the length of ABCDE 7 for the length of BBøæåBB and so on, 4 bytes per character...

hex dump of whcar_t* output

So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.

Upvotes: 5

Views: 2850

Answers (4)

wesley.mesquita
wesley.mesquita

Reputation: 785

As our collegues pointed out, fwrite does not know about the encoding.

First, take a serius looking at this link, it is has a great overview of encodings:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

If you don´t want to use any external libs you will have to deal with your strings in a low level manner.

For instance, if you are sure about using wchar_t (e.g., expecting UTF-16 encoding), an approach is to resize the len passed to write_to_file according to platform size of wchar_t, so fwrite will write the correct number of bytes.

Like this:

write_to_file(fp, ABCDE, sizeof(wchar_t)*wcslen(ABCDE) );

You have 5 wchar_t´s, but in Windows/MingGW each of then is 2 bytes long.

Remember to consider the BOM (Byte Order Mark) when dealing with UTF-16. It can be valuable to get bytes in the right order.

Encodings like UTF-8 has a strictly more complex approach if you want to deal with it (take a look at Wikipedia), and maybe using a out-of-the-shelf lib can be a good idea. I don´t have extensive experience on UTF-8 over C++ and I´ll let the collegues indicate a good lib!

To finalize, take a look in new strings that arrived at C++11:

u32string and u16string

That can be helpful to guarantee character size.

(and don´t forget the old wstring, but as usual you wchat_t if platform dependent )

Upvotes: 0

wacky6
wacky6

Reputation: 145

Standard C Library does not encode anything.

If you need portability, it is better to handle then encoding explicitly. libiconv and libicu both work well. You only need to convert data to a certain encoding, for example UTF8, then save the string to disk using fwrite().

You should also use char not wchar_t, because wchar_t is at least 16 bits, which may lead to endianess problem on a different platform.

As for strlen(), it is designed to be used with ANSI string, to determine a string of wchar_t, you should use wcslen() (if available) instead. otherwise, it is better to use explicit conversion on strings.

Upvotes: 0

James Kanze
James Kanze

Reputation: 154037

There is no real answer to your question. Practically everything involving encoding is implementation dependent, and often locale dependent as well. Judging from appearances, your narrow character encoding is Unicode UTF-8, and your wide character encoding is Unicode UTF-32LE. This is far from universal, however; even today, I suspect that the most widespread narrow character encoding is ISO 8859-1, and there are still machines which use EBCDIC. For wide character encodings, both UTF-16 and UTF-32 are widespread, and some machines still use older encodings as well. (If you use C++ style IO, you can embed a specific encoding in the stream itself.)

As for your code, fwrite doesn't know (or care) that it is dealing with characters. It just copies an image of memory out to disk (which makes it pretty useless, except for sequences of pre-formatted bytes, since such images generally can't be reliably read back in).

As for strlen: it doesn't know about multibyte characters; it returns the number of bytes until the first 0 byte, not the number of characters. The number of bytes is likely to be superior to the number of characters for any multibyte encoding format. But the issue is even more complex. Independently of the encoding format, there are cases where a sequence of more than one code point will result in a single character; e.g. "\u0063\u0302" will represent a single character, although functions like strlen or wcslen (assuming a wide character string literal) will report more.

Upvotes: 3

user1781290
user1781290

Reputation: 2874

As far as I know, the standard C library does no encoding at all. I suppose your input file in the first case uses UTF-8 as encoding, thus your string constants will end up as UTF-8-string constants in compiled code. That is why you get the string with a length of 10 chars.

fwrite takes an (untyped) byte array as argument. Since it does not know anything about the bytes processed, it cannot do any encoding-conversion at all here.

Regarding portability, you should be more careful about things like pointer lengths. fwrite(&len, sizeof(size_t), 1, fp)can yield different results on different platforms, maybe causing your file to be read incorrectly. Also (especially with multi-byte encodings) you have to be careful with the platform's endianness.

For anything else, you can be sure, that your standard library will put the bytes to disk exactly as you pass them, but when processing them as text, you have to make sure that you use the same encoding on all platforms.

Upvotes: 5

Related Questions