Reputation: 14525
Per MSDN:
"For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII."
C++03
2.1 Phases of translation
"..Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)"
2.13.2 Character literals
"A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding."
To test which execution character set is used by MSVC++, I wrote the following code:
wchar_t *str = L"中";
unsigned char *p = reinterpret_cast<unsigned char*>(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
printf ("%x ", *(p + i));
}
The output shows that 2d 4e 0 0
, and 0x4e2d
is the UTF-16 encoding of this Chinese character. So I conclude: UTF-16 is used as execution character set by MSVC (My version: 2012 4.5.50709)
After, I tried to print this character out to a Windows console. Since the default locale used by console is "C"
, I set the locale to code page 936 representing simplified Chinese characters.
// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls\n", str);
Which outputs:
中
What I'm curious about is, how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16(MS code page 936)? How can that happen?
Upvotes: 4
Views: 1858
Reputation: 14525
I think I get it.
In Microsoft C++ 2008(probably 2005+), CRT functions as wprintf
, wcout
are implemented such that they convert a wide string literal as L"中"
encoded in UTF-16, under the hood, to match the current locale/code page setting. So what happens here is that L"中"
is converted to bytes D6 D0
in code page 936 for simplified Chinese.
I was wrong that setlocale
set the console code page. It just set the current program code page which is used by CRT functions during the "conversion". For changing console code page, command chcp
or Win API SetConsoleOputputCP()
achieves.
Since my console's default page is 936, that character can be correctly shown w/o problem.
Upvotes: 1
Reputation: 536615
how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16
There are two ways you can write text to the console. The byte way, using the Win32 API WriteConsoleA
, gives you characters from bytes interpreted using the console's code page ("ANSI"). The Unicode way, WriteConsoleW
, receives a UTF-16LE string and writes the characters to the console directly without having to worry about what code page it is using.
The stdio function printf
uses WriteConsoleA
when the output is an interactive console. The wprintf
function, from VS 2005 on at least, calls WriteConsoleW
.
Upvotes: 1