Eric Z
Eric Z

Reputation: 14525

Why a Windows console with Chinese code page set can show a UTF-16 encoded character?

Per MSDN:

"For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII."

C++03

2.1 Phases of translation

"..Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)"

2.13.2 Character literals

"A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding."

To test which execution character set is used by MSVC++, I wrote the following code:

wchar_t *str = L"中";
unsigned char *p = reinterpret_cast<unsigned char*>(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
   printf ("%x ", *(p + i));
}

The output shows that 2d 4e 0 0, and 0x4e2d is the UTF-16 encoding of this Chinese character. So I conclude: UTF-16 is used as execution character set by MSVC (My version: 2012 4.5.50709)

After, I tried to print this character out to a Windows console. Since the default locale used by console is "C", I set the locale to code page 936 representing simplified Chinese characters.

// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls\n", str);

Which outputs:

What I'm curious about is, how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16(MS code page 936)? How can that happen?

Upvotes: 4

Views: 1858

Answers (2)

Eric Z
Eric Z

Reputation: 14525

I think I get it.

In Microsoft C++ 2008(probably 2005+), CRT functions as wprintf, wcout are implemented such that they convert a wide string literal as L"中" encoded in UTF-16, under the hood, to match the current locale/code page setting. So what happens here is that L"中" is converted to bytes D6 D0 in code page 936 for simplified Chinese.

I was wrong that setlocale set the console code page. It just set the current program code page which is used by CRT functions during the "conversion". For changing console code page, command chcp or Win API SetConsoleOputputCP() achieves.

Since my console's default page is 936, that character can be correctly shown w/o problem.

Upvotes: 1

bobince
bobince

Reputation: 536615

how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16

There are two ways you can write text to the console. The byte way, using the Win32 API WriteConsoleA, gives you characters from bytes interpreted using the console's code page ("ANSI"). The Unicode way, WriteConsoleW, receives a UTF-16LE string and writes the characters to the console directly without having to worry about what code page it is using.

The stdio function printf uses WriteConsoleA when the output is an interactive console. The wprintf function, from VS 2005 on at least, calls WriteConsoleW.

Upvotes: 1

Related Questions