printing the char value of each wide character's bytes

Question

when running the following:

char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
    printf("acute_accent[%d]: %c
", i, acute_accent[i]);
}

I get:

acute_accent[0]: 
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r

which makes me think that the multibyte character é is 2-byte wide.

However, when running this (after ignoring the compiler warning me from multi-character character constant):

printf("size: %lu",sizeof('é'));

I get size: 4.

What's the reason for the different sizes?

EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.

Zhro · Accepted Answer

The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.

See here:

http://www.fileformat.info/info/unicode/char/e9/index.htm

And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.

Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.

To prevent undefined behavior you should always clearly identify the encoding type for string literals.

For example:

char acute_accent[7] = u8"éclair"

This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.

It's much safer to use this instead:

const char* acute_accent = u8"éclair"

Notice how your string is actually 8-bytes:

#include 
#include  // strlen

int main() {
    const char* a = u8"éclair";

    printf("String length : %lu
", strlen(a));

    // Add +1 for the null byte
    printf("String size   : %lu
", strlen(a) + 1);

    return 0;
}

The output is:

String length : 7
String size   : 8

Also note that the size of a char is different between C and C++!!

#include 

int main() {
    printf("%lu
", sizeof('a'));

    printf("%lu
", sizeof('é'));

    return 0;
}

In C the output is:

4
4

While in C++ the output is:

1
4

printing the char value of each wide character's bytes

Answers (2)

Related Questions

printing the char value of each wide character&#39;s bytes

Answers (2)

Related Questions

printing the char value of each wide character's bytes