Reputation: 10460
when running the following:
char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}
I get:
acute_accent[0]:
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r
which makes me think that the multibyte character é
is 2-byte wide.
However, when running this (after ignoring the compiler warning me from multi-character character constant
):
printf("size: %lu",sizeof('é'));
I get size: 4
.
What's the reason for the different sizes?
EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.
Upvotes: 1
Views: 73
Reputation: 2614
The reason you're seeing a discrepancy is because in your first example, the character é
was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9
.
See here:
http://www.fileformat.info/info/unicode/char/e9/index.htm
And as described by dbush, the character 'é'
was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.
Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.
To prevent undefined behavior you should always clearly identify the encoding type for string literals.
For example:
char acute_accent[7] = u8"éclair"
This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.
It's much safer to use this instead:
const char* acute_accent = u8"éclair"
Notice how your string is actually 8-bytes:
#include <stdio.h>
#include <string.h> // strlen
int main() {
const char* a = u8"éclair";
printf("String length : %lu\n", strlen(a));
// Add +1 for the null byte
printf("String size : %lu\n", strlen(a) + 1);
return 0;
}
The output is:
String length : 7
String size : 8
Also note that the size of a char is different between C and C++!!
#include <stdio.h>
int main() {
printf("%lu\n", sizeof('a'));
printf("%lu\n", sizeof('é'));
return 0;
}
In C the output is:
4
4
While in C++ the output is:
1
4
Upvotes: 2
Reputation: 223897
From the C99 standard, section 6.4.4.4:
2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
...
10 An integer character constant has type int.
sizeof(int)
on your machine is probably 4, which is why you're getting that result.
So 'é'
, 'c'
, 'l'
are all integer character constants, so all are of type int
whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.
Upvotes: 0