Reputation: 3819
I am currently learning how to work with UTF-XX encoded files and text.
I have this simple example:
std::ifstream ifs;
ifs.open("data/text.txt");
do {
char c;
ifs.get(c);
printf("%x\n", c);
} while (!ifs.eof());
Where the file text.txt
contains the following strings:
yabloko
яблоко
The results looks like this:
79
61
62
6c
6f
6b
6f
a
ffffffd1
ffffff8f
ffffffd0
ffffffb1
ffffffd0
ffffffbb
ffffffd0
ffffffbe
ffffffd0
ffffffba
ffffffd0
ffffffbe
I do understand why I have twice the number of lines for the cyrillic word (because it's UTF-8 encoded and that each character in this case is using 2 bytes), my questions is about what get()
and printf()
are doing. More precisely why is my character c printed out as a int? with the first 3 bytes set to FFF? When I look at the doc for the get()
method I see:
int get();
istream& get (char& c);
I tried both option. I see the first one returns an int. The second takes a char? I am really confused? Why would these functions extracts anything else from a file than just a single byte (char) at a time and why is the value for the cyrillic characters printed out as for example ffffffd1
instead of d1
?
Upvotes: 0
Views: 112
Reputation: 136425
More precisely why is my character
c
printed out as aint
?
Because char
is promoted to int
when passed to ...
argument of printf
. On your platform char
is signed, hence all codes above 127 get promoted to a negative int
.
You may like to use %hhx
format specifier to print char
.
int istream::get()
returns an int
rather than char
to be able to distinguish the read character from EOF
. Traits::eof()
is normally int(-1)
. No Unicode character has this code.
Upvotes: 3