Reputation: 7506
Here's the thing (https://learn.microsoft.com/en-us/cpp/build/reference/source-charset-set-source-character-set) that I know all about VC++ /source-charset
and /execution-charset
.
So there are 3 things I need to keep the same (if anything wrong, please correct me):
So, if I save source file with encodingA
, set /source-charset
and /execution-charset
as encodingA
, and have code wchar_t c = L'é';
or char16_t c = u'é';
or char32_t c = U'é'
,
will the program change the code unit of é
depending on encodingA
I choose during the "interpreting"?
Or é
's code unit would never change no matter what encoding I choose?
(Don't concern about the console output)
Upvotes: 2
Views: 588
Reputation: 596417
/source-charset
dictates how Unicode is stored as bytes in your source file on disk, nothing more. The code editor knows é
is Unicode codepoint U+00E9 and will encode it to file accordingly (0xE9
in Latin-1, 0xC3 0xA9
in UTF-8, etc).
When the compiler then reads the source file, it converts the file's bytes to Unicode using the specified /source-charset
, and then processes Unicode data as needed. At this stage, provided the correct /source-encoding
is used so the file's bytes are decoded properly, the é
is read back in as Unicode codepoint U+00E9, and is not handled in any particular encoding until the next step.
The /execution-charset
dictates what encoding Unicode data is saved as in the executable if no other encoding is specified in the code. It does not apply in your examples because the L
/u
/U
prefixes dictate the encoding (L
= UTF-16 or UTF-32, depending on platform, u
= UTF-16, U
= UTF-32). So:
wchar_t wc = L'é'; // 0xE9 0x00 or 0xE9 0x00 0x00 0x00
char16_t c16 = u'é'; // 0xE9 0x00
char32_t c32 = U'é'; // 0xE9 0x00 0x00 0x00
Were you using char
instead, then /execution-charset
would apply:
char c = 'é'; // MAYBE 0xE9 or other single-byte value, or a multi-byte overflow warning/error
const char *s = "é"; // MAYBE 0xE9 or other single-byte value, or maybe 0xC3 0xA9
Unless you use the u8
prefix for UTF-8:
char c = u8'é'; // illegal!
const char *s8 = u8"é", // 0xC3 0xA9
Upvotes: 6
Reputation: 85371
When you write wchar_t c = L'é';
in the source file, it needs to be converted to raw bytes somehow, and the encoding you use when saving the source file will influence the encoding of é
.
Obviously the encoding you used to store the source file should match the compiler's source charset setting. The compiler literally reads your source file and interprets its contents based on the configured encoding.
Like if you saved 'é'
in UTF-8 and read back in ISO-8859-1, you'd see 'é'
.
But if you saved 'é'
in ISO-8859-1 and read back in UTF-8, you'd get either a bad encoding error or a fallback to some other encoding.
It depends on what non-ASCII characters you use in your source files. If only latin-1, then it's best to store the source in Windows-1252 (or whatever the default encoding is for your locale) because MSVC defaults the source charset to that when no BOM is present. Then you won't need to specify any /source-charset
.
If you use not only latin characters, or you want maximum portability, the best would be to use UTF-8 and pass /utf-8
flag to cl.exe
, which is a shorthand for /source-charset:utf-8 /execution-charset:utf-8
.
Upvotes: 2