Reputation:
I wanted to print individual bytes of word "česnek" expecting to printf 7 bytes, because "č" is coded in 2 bytes, which it does but prints garbage character such as a question mark in terminal. If I print out the integer value, I get this sequence.
-60 -115 101 115 110 101 107
Why are the first two numbers negative? Here is the code I used to try it.
char *utfstring = "česnek";
for(size_t i = 0; i < strlen(utfstring); i++) {
printf("%c ", utfstring[i]);
}
for(size_t i = 0; i < strlen(utfstring); i++) {
printf("%d ", utfstring[i]);
}
I expected first two values to be c4 8d because č is encoded like that according to https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&unicodeinhtml=dec
Upvotes: 2
Views: 7745
Reputation: 17248
First, the signedness of char
is implementation-defined. On top of that, you're telling printf()
to print a signed number by using %d
. To portably print them as unsigned numbers, you need to cast them to unsigned
and print them using the %u
format specifier:
printf("%u ", (unsigned char) utfstring[i]);
That'll take care of the negative numbers, but you have another problem: the C standard does not require a compiler to accept UTF-8 encoded characters in source code. Only a small set of basic characters are guaranteed by the standard. You may need to check the documentation for your specific compiler and standard library to see how this is handled. You may get UTF-8, some other encoding, or garbage; and whatever you get, it isn't portable. If this sounds lame, you're right, it is - C/C++ have been playing catch-up for a long time when it comes to i18n.
The good news is, things are getting better. If your compiler supports C11, you can and should take advantage of UTF-8 string literals to portably encode UTF-8 code points in strings.
Upvotes: 1
Reputation: 31599
Use (unsigned char)utfstring[i]
or 0xFF & utfstring[i]
to get hexadecimal output as follows:
char *utfstring = u8"česnek";
for(size_t i = 0; i < strlen(utfstring); i++)
printf("%02X ", 0xFF & utfstring[i]);
output:
"C4 8D 65 73 6E 65 6B"
The first alphabetic character č
cannot be represented by a single byte in UTF8. If you print utfstring
one byte at a time, then the UTF8 encoding is broken.
It has to be printed as u8"č"
or u8"\xC4\x8D"
In general you will need a Unicode library, such as iconv, if you wish to break the byte sequence in to separate Unicode code points. If you are simply trying to find č
, then use the standard string functions, for example strstr(utfstring, u8"č")
.
Upvotes: 5
Reputation: 8945
Your for
-loop iterates through the character value byte-by-byte, when the UTF representation is multi-byte.
char *utfstring = "česnek";
is more than six bytes long! Because the first "character" in that string occupies more than one byte. (The cleverness of the UTF representation is that each of the bytes are self-encoded in such a way that, by examining the binary content of each byte alone, you can reliably determine what "kind" of byte it is, and where it falls [if applicable] in a multi-byte sequence.)
Your logic tries to use %c
and then %d
formats against these bytes when, arguably, neither one is most appropriate. "In this [human] context, these aren't really characters, nor are they integers." Try %x
... hexadecimal. "Show me the bits."
Upvotes: 0