Explanations about the Ifstream get() method behaviour when reading UTF-8 encoded text (C++)

Question

I am currently learning how to work with UTF-XX encoded files and text.

I have this simple example:

std::ifstream ifs;
ifs.open("data/text.txt");
do {
    char c;
    ifs.get(c);
    printf("%x
", c);
} while (!ifs.eof());

Where the file text.txt contains the following strings:

yabloko
яблоко

The results looks like this:

79
61
62
6c
6f
6b
6f
a
ffffffd1
ffffff8f
ffffffd0
ffffffb1
ffffffd0
ffffffbb
ffffffd0
ffffffbe
ffffffd0
ffffffba
ffffffd0
ffffffbe

I do understand why I have twice the number of lines for the cyrillic word (because it's UTF-8 encoded and that each character in this case is using 2 bytes), my questions is about what get() and printf() are doing. More precisely why is my character c printed out as a int? with the first 3 bytes set to FFF? When I look at the doc for the get() method I see:

int get();
istream& get (char& c);

I tried both option. I see the first one returns an int. The second takes a char? I am really confused? Why would these functions extracts anything else from a file than just a single byte (char) at a time and why is the value for the cyrillic characters printed out as for example ffffffd1 instead of d1?

Maxim Egorushkin · Accepted Answer

More precisely why is my character c printed out as a int?

Because char is promoted to int when passed to ... argument of printf. On your platform char is signed, hence all codes above 127 get promoted to a negative int.

You may like to use %hhx format specifier to print char.

int istream::get() returns an int rather than char to be able to distinguish the read character from EOF. Traits::eof() is normally int(-1). No Unicode character has this code.

Explanations about the Ifstream get() method behaviour when reading UTF-8 encoded text (C++)

Answers (1)

Related Questions