Reputation: 75
From reading docs in either MSDN or the n1256
committee draft, I was under the impression that a char
would always be exactly CHAR_BIT
bits as defined in <limits.h>
.
If CHAR_BIT
is set to 8, then a byte is 8 bits long, and so is a char
.
Given the following C code:
int main(int argc, char **argv) {
int length = 0;
while (argv[1][length] != '\0') {
// print the character, its hexa value, and its size
printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
length,
argv[1][length],
argv[1][length],
sizeof argv[1][length]);
length++;
}
printf("\nTotal length: %u\n", length);
printf("Actual char size: %u\n", CHAR_BIT);
return 0;
}
I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç
and à
.
Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça
has a length of 3 for example (4 if counting the \0
) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).
$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t value: 0x74 sizeof char: 1
char 1: e value: 0x65 sizeof char: 1
char 2: s value: 0x73 sizeof char: 1
char 3: t value: 0x74 sizeof char: 1
char 4: _ value: 0x5F sizeof char: 1
char 5: τ value: 0xFFFFFFE7 sizeof char: 1
char 6: α value: 0xFFFFFFE0 sizeof char: 1
Total length: 7
Actual char size: 8
What is probably happening under the hood is char **argv
is turned into int **argv
. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.
CHAR_BIT == 8
and sizeof(achar) == 1
and somechar = 0xFFFFFFE7
. This seems counter-intuitive. What's happening ?Upvotes: 0
Views: 422
Reputation: 385789
No, it's not received as an array of int
.
But it's not far from the truth: printf
is indeed receiving the char
as an int
.
When passing an integer type small than an int
to a vararg function like printf
, it gets promoted to an int
. On your system, char
is a signed type.[1] Given a char
with a value of -25, an int
with a value of -25 was passed to printf
. %u
expects an unsigned int
, so it's treating the int
with a value of -25 as an unsigned int
, printing 0xFFFFFFE7
.
A simple fix:
printf("%X\n", (unsigned char)c); // 74 65 73 74 5F E7 E0
But why did you get E7 and E0 in the first place?
Each Windows system call that deals with text has two versions:
A
) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.W
) version that deals with text encoded using UTF-16le.The command line is being obtained from the system using GetCommandLineA
, the A
version of GetCommandLine
. Your system uses cp1252 as its ACP. Encoded using cp1252, ç
is E7, and à
is E0.
GetCommandLineW
will provide the command line as UTF-16le, and CommandLineToArgvW
will parse it.
Finally, why did E7 and E0 show as τ
and α
?
The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ
is E7, and α
is E0.
Issuing chcp 1252
will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)
You can query the terminal's encoding using GetConsoleCP
(for input) and GetConsoleOutputCP
(for output). Yeah, apparently they can be different? I don't know how that would happen in practice.
char
is a signed or unsigned type.Upvotes: 3
Reputation: 144740
From your code and the output on your system, it appears that:
char
has indeed 8 bits. Its size is 1 by definition. char **argv
is a pointer to an array of pointers to C strings, null terminated arrays of char
(8-bit bytes).char
type is signed for your compiler configuration, hence the output 0xFFFFFFE7
and 0xFFFFFFE0
for values beyond 127. char
values are passed as int
to printf
, which interprets the value as unsigned for the %X
conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char
type unsigned by default with -funsigned-char
, a safer choice that is also more consistent with the C library behavior.çà
are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ
and α
.
Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.
UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.
Upvotes: 2