Why a Windows console with Chinese code page set can show a UTF-16 encoded character?

Question

"For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII."

C++03

2.1 Phases of translation

"..Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)"

2.13.2 Character literals

"A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding."

To test which execution character set is used by MSVC++, I wrote the following code:

wchar_t *str = L"中";
unsigned char *p = reinterpret_cast(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
   printf ("%x ", *(p + i));
}

The output shows that 2d 4e 0 0, and 0x4e2d is the UTF-16 encoding of this Chinese character. So I conclude: UTF-16 is used as execution character set by MSVC (My version: 2012 4.5.50709)

After, I tried to print this character out to a Windows console. Since the default locale used by console is "C", I set the locale to code page 936 representing simplified Chinese characters.

// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls
", str);

Which outputs:

中

What I'm curious about is, how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16(MS code page 936)? How can that happen?

Eric Z · Accepted Answer

I think I get it.

In Microsoft C++ 2008(probably 2005+), CRT functions as wprintf, wcout are implemented such that they convert a wide string literal as L"中" encoded in UTF-16, under the hood, to match the current locale/code page setting. So what happens here is that L"中" is converted to bytes D6 D0 in code page 936 for simplified Chinese.

I was wrong that setlocale set the console code page. It just set the current program code page which is used by CRT functions during the "conversion". For changing console code page, command chcp or Win API SetConsoleOputputCP() achieves.

Since my console's default page is 936, that character can be correctly shown w/o problem.

Why a Windows console with Chinese code page set can show a UTF-16 encoded character?

Answers (2)

Related Questions