Rick
Rick

Reputation: 7506

Do source file encoding or execution charset change how wchar_t internally saves?

Here's the thing (https://learn.microsoft.com/en-us/cpp/build/reference/source-charset-set-source-character-set) that I know all about VC++ /source-charset and /execution-charset.

So there are 3 things I need to keep the same (if anything wrong, please correct me):

  1. source file encoding
  2. the /source-charset setting (determine how the compiler would interpret my source file)
  3. the /execution-charset setting (determine how the compiler would interpret "the output stuff" from stage 2 into executable.

So, if I save source file with encodingA, set /source-charset and /execution-charset as encodingA, and have code wchar_t c = L'é'; or char16_t c = u'é'; or char32_t c = U'é',

will the program change the code unit of é depending on encodingA I choose during the "interpreting"?

Or é's code unit would never change no matter what encoding I choose?

(Don't concern about the console output)

Upvotes: 2

Views: 588

Answers (2)

Remy Lebeau
Remy Lebeau

Reputation: 596417

/source-charset dictates how Unicode is stored as bytes in your source file on disk, nothing more. The code editor knows é is Unicode codepoint U+00E9 and will encode it to file accordingly (0xE9 in Latin-1, 0xC3 0xA9 in UTF-8, etc).

When the compiler then reads the source file, it converts the file's bytes to Unicode using the specified /source-charset, and then processes Unicode data as needed. At this stage, provided the correct /source-encoding is used so the file's bytes are decoded properly, the é is read back in as Unicode codepoint U+00E9, and is not handled in any particular encoding until the next step.

The /execution-charset dictates what encoding Unicode data is saved as in the executable if no other encoding is specified in the code. It does not apply in your examples because the L/u/U prefixes dictate the encoding (L = UTF-16 or UTF-32, depending on platform, u = UTF-16, U = UTF-32). So:

wchar_t wc = L'é'; // 0xE9 0x00 or 0xE9 0x00 0x00 0x00

char16_t c16 = u'é'; // 0xE9 0x00

char32_t c32 = U'é'; // 0xE9 0x00 0x00 0x00

Were you using char instead, then /execution-charset would apply:

char c = 'é';  // MAYBE 0xE9 or other single-byte value, or a multi-byte overflow warning/error

const char *s = "é";  // MAYBE 0xE9 or other single-byte value, or maybe 0xC3 0xA9

Unless you use the u8 prefix for UTF-8:

char c = u8'é'; // illegal!

const char *s8 = u8"é",  // 0xC3 0xA9

Upvotes: 6

rustyx
rustyx

Reputation: 85371

When you write wchar_t c = L'é'; in the source file, it needs to be converted to raw bytes somehow, and the encoding you use when saving the source file will influence the encoding of é.

Obviously the encoding you used to store the source file should match the compiler's source charset setting. The compiler literally reads your source file and interprets its contents based on the configured encoding.

Like if you saved 'é' in UTF-8 and read back in ISO-8859-1, you'd see 'é'.

But if you saved 'é' in ISO-8859-1 and read back in UTF-8, you'd get either a bad encoding error or a fallback to some other encoding.

It depends on what non-ASCII characters you use in your source files. If only latin-1, then it's best to store the source in Windows-1252 (or whatever the default encoding is for your locale) because MSVC defaults the source charset to that when no BOM is present. Then you won't need to specify any /source-charset.

If you use not only latin characters, or you want maximum portability, the best would be to use UTF-8 and pass /utf-8 flag to cl.exe, which is a shorthand for /source-charset:utf-8 /execution-charset:utf-8.

Upvotes: 2

Related Questions