Do source file encoding or execution charset change how wchar_t internally saves?

Question

Here's the thing (https://learn.microsoft.com/en-us/cpp/build/reference/source-charset-set-source-character-set) that I know all about VC++ /source-charset and /execution-charset.

So there are 3 things I need to keep the same (if anything wrong, please correct me):

source file encoding
the /source-charset setting (determine how the compiler would interpret my source file)
the /execution-charset setting (determine how the compiler would interpret "the output stuff" from stage 2 into executable.

So, if I save source file with encodingA, set /source-charset and /execution-charset as encodingA, and have code wchar_t c = L'é'; or char16_t c = u'é'; or char32_t c = U'é',

will the program change the code unit of é depending on encodingA I choose during the "interpreting"?

Or é's code unit would never change no matter what encoding I choose?

(Don't concern about the console output)

Remy Lebeau · Accepted Answer

/source-charset dictates how Unicode is stored as bytes in your source file on disk, nothing more. The code editor knows é is Unicode codepoint U+00E9 and will encode it to file accordingly (0xE9 in Latin-1, 0xC3 0xA9 in UTF-8, etc).

When the compiler then reads the source file, it converts the file's bytes to Unicode using the specified /source-charset, and then processes Unicode data as needed. At this stage, provided the correct /source-encoding is used so the file's bytes are decoded properly, the é is read back in as Unicode codepoint U+00E9, and is not handled in any particular encoding until the next step.

The /execution-charset dictates what encoding Unicode data is saved as in the executable if no other encoding is specified in the code. It does not apply in your examples because the L/u/U prefixes dictate the encoding (L = UTF-16 or UTF-32, depending on platform, u = UTF-16, U = UTF-32). So:

wchar_t wc = L'é'; // 0xE9 0x00 or 0xE9 0x00 0x00 0x00

char16_t c16 = u'é'; // 0xE9 0x00

char32_t c32 = U'é'; // 0xE9 0x00 0x00 0x00

Were you using char instead, then /execution-charset would apply:

char c = 'é';  // MAYBE 0xE9 or other single-byte value, or a multi-byte overflow warning/error

const char *s = "é";  // MAYBE 0xE9 or other single-byte value, or maybe 0xC3 0xA9

Unless you use the u8 prefix for UTF-8:

char c = u8'é'; // illegal!

const char *s8 = u8"é",  // 0xC3 0xA9

Do source file encoding or execution charset change how wchar_t internally saves?

Answers (2)

Related Questions