Valentin O.
Valentin O.

Reputation: 75

What actually is the type of C `char **argv` on Windows

From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in <limits.h>. If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.

Test code

Given the following C code:

int main(int argc, char **argv) {
    int length = 0;
    while (argv[1][length] != '\0') {
        // print the character, its hexa value, and its size
        printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
                length,
                argv[1][length],
                argv[1][length],
                sizeof argv[1][length]);
        length++;
    }
    printf("\nTotal length: %u\n", length);
    printf("Actual char size: %u\n", CHAR_BIT);
     
    return 0;
}

I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.

Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).

Output

$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t       value: 0x74      sizeof char: 1
char 1: e       value: 0x65      sizeof char: 1
char 2: s       value: 0x73      sizeof char: 1
char 3: t       value: 0x74      sizeof char: 1
char 4: _       value: 0x5F      sizeof char: 1
char 5: τ       value: 0xFFFFFFE7        sizeof char: 1
char 6: α       value: 0xFFFFFFE0        sizeof char: 1

Total length: 7
Actual char size: 8

Question

What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.

  1. Is that what actually happens ?
  2. Is it standard behaviour ?
  3. Why chars 5 and 6 are not what is given as input ?
  4. CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?

Environment

Upvotes: 0

Views: 422

Answers (2)

ikegami
ikegami

Reputation: 385789

No, it's not received as an array of int.

But it's not far from the truth: printf is indeed receiving the char as an int.

When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.

A simple fix:

printf("%X\n", (unsigned char)c);   // 74 65 73 74 5F E7 E0

But why did you get E7 and E0 in the first place?

Each Windows system call that deals with text has two versions:

  • An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.
  • And a Wide (W) version that deals with text encoded using UTF-16le.

The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.

GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.


Finally, why did E7 and E0 show as τ and α?

The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.

Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)

You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.


  1. It's up the compiler whether char is a signed or unsigned type.
  2. This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).

Upvotes: 3

chqrlie
chqrlie

Reputation: 144740

From your code and the output on your system, it appears that:

  • type char has indeed 8 bits. Its size is 1 by definition. char **argv is a pointer to an array of pointers to C strings, null terminated arrays of char (8-bit bytes).
  • the char type is signed for your compiler configuration, hence the output 0xFFFFFFE7 and 0xFFFFFFE0 for values beyond 127. char values are passed as int to printf, which interprets the value as unsigned for the %X conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char type unsigned by default with -funsigned-char, a safer choice that is also more consistent with the C library behavior.
  • the 2 non ASCII characters çà are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.

The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ and α.

Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.

UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.

Upvotes: 2

Related Questions