Reputation: 105227
To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.
It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.
The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _
) seem particular telling?
#include <stdio.h>
#include <string.h>
int main() {
char str[] = "João"; // ã does not belong to the standard
// (or extended) ASCII characters
printf("number of chars = %d\n", (int)strlen(str)); // 5
int len = 0;
while (str[len] != '\0')
len++;
printf("number of bytes = %d\n", len); // 5
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// "João"
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__�__�__o_ -> wow!!!
str[2] = 'X'; // let's change this special character
// and see what happens
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// JoX�o
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__X__�__o_
}
I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?
Upvotes: 1
Views: 138
Reputation: 154243
printf("_%c_", str[i]);
prints the character associated with each str[i]
- one at a time.
The value of char str[i]
is converted to an int
when passed ot a ...
function. The int
value is then converted to unsigned char
as directed by "%c"
and "and the resulting character is written".
char str[] = "João";
does not certainly specify a UTF8 sequence. That in an implementation detail. A specified way is to use char str[] = u8"João";
since C11 (or maybe C99).
printf()
does not specify a direct way to print UTF8 stirrings.
Upvotes: 0
Reputation: 211720
The printing isn't a function of C, but of the display context, whatever that is. For a terminal there are UTF-8 decoding functions which map the raw character data into the character to be shown on screen using a particular font. A similar sort of display logic happens in graphical applications, though with even more complexity relating to proportional font widths, ligatures, hyphenation, and numerous other typographical concerns.
Internally this is often done by decoding UTF-8 into some intermediate form first, like UTF-16 or UTF-32, for look-up purposes. In extremely simple terms, each character in a font has a Unicode identifier. In practice this is a lot more complicated as there is room for character variants, and multiple characters may be represented by a singular character in a font, like "fi" and "ff" ligatures. Accented characters like "ç" may be a combination of characters, as allowed by Unicode. That's where things like Zalgo text come about: you can often stack a truly ridiculous number of Unicode "combining characters" together into a single output character.
Typography is a complex world with complex libraries required to render properly.
You can handle UTF-8 data in C, but only with special libraries. Nothing that C ships with in the Standard Library can understand them, to C it's just a series of bytes, and it assumes byte is equivalent to character for the purposes of length. That is strlen
and such work with bytes as a unit, not characters.
C++, as an example, has much better support for this distinction between byte and character. Other languages have even better support, with languages like Swift having exceptional support for UTF-8 specifically and Unicode in general.
Upvotes: 1