What actually is the type of C `char **argv` on Windows

Question

From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in . If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.

Test code

Given the following C code:

int main(int argc, char **argv) {
    int length = 0;
    while (argv[1][length] != '\0') {
        // print the character, its hexa value, and its size
        printf("char %u: %c	value: 0x%X	 sizeof char: %u
",
                length,
                argv[1][length],
                argv[1][length],
                sizeof argv[1][length]);
        length++;
    }
    printf("
Total length: %u
", length);
    printf("Actual char size: %u
", CHAR_BIT);
     
    return 0;
}

I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.

Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).

Output

$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t       value: 0x74      sizeof char: 1
char 1: e       value: 0x65      sizeof char: 1
char 2: s       value: 0x73      sizeof char: 1
char 3: t       value: 0x74      sizeof char: 1
char 4: _       value: 0x5F      sizeof char: 1
char 5: τ       value: 0xFFFFFFE7        sizeof char: 1
char 6: α       value: 0xFFFFFFE0        sizeof char: 1

Total length: 7
Actual char size: 8

Question

What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.

Is that what actually happens ?
Is it standard behaviour ?
Why chars 5 and 6 are not what is given as input ?
CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?

Environment

Windows 10
Terminal: Alacritty and Windows default cmd (tried in both just in case)
GCC under Mingw-w64

ikegami · Accepted Answer

No, it's not received as an array of int.

But it's not far from the truth: printf is indeed receiving the char as an int.

When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.^[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.

A simple fix:

printf("%X
", (unsigned char)c);   // 74 65 73 74 5F E7 E0

But why did you get E7 and E0 in the first place?

Each Windows system call that deals with text has two versions:

An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.^[2] For en-us installs of Windows, this is cp1252.
And a Wide (W) version that deals with text encoded using UTF-16le.

The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.

GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.

Finally, why did E7 and E0 show as τ and α?

The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.

Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)

You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.

It's up the compiler whether char is a signed or unsigned type.
This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).

What actually is the type of C `char **argv` on Windows

Test code

Output

Question

Environment

Answers (2)

Related Questions