Reputation: 3797
In today(2023.01)'s MSDN https://learn.microsoft.com/en-us/windows/win32/inputdev/wm-char , Microsoft says that:
... Otherwise(using ANSI version of RegisterClass), the system provides characters in the current process code page, which can be set to UTF-8 in Windows Version 1903 (May 2019 Update) and newer.
But, I just can NOT see WM_CHAR presenting Unicode character in UTF8 sequence. Am I doing wrong, or the document is wrong/misleading?
I do the experiments on Win10.21H2, using Keyview2A.exe v1.8, which is based on Charles Petzold's Keyview2 demo program in his famous book Programming Windows 5th-ed (1998).
I'm trying on Win10.21H2 .
I try to type a Chinese character 电, which is U+7535, and GBK encoding B5 E7
.
I just got 0x3F(?), sigh!
SBCS = Single-byte character set. DBCS = Double-byte character set. MBCS = Multi-byte character set. (generic name for SBCS, DBCS and 3+byte character set)
Most European countries use such character set.
Type in some Russian letters:
Type in some Greek letters:
[20230121.c1] So far, I seem to have found out the rule about "enabling UTF8ACP", for an ANSI(narrow-char) program. Summarized below:
The IME produces Unicode value for any human-input character. When Windows need to send that character to KeyviewA, it does the following:
GetKeyboardLayout(0)
.curhkl
). This can be acquired by curcodepage=GetLocaleInfo(LOWORD(curhkl), LOCALE_IDEFAULTANSICODEPAGE, ...);
.WideCharToMultiByte(curcodepage, ...)
to convert the Unicode value to MBCS sequence.
Upvotes: 2
Views: 504
Reputation: 3797
Looks like I need to answer my own question after some investigation. The answer is something that you and I cannot acquire by merely reading MSDN.
I did see UTF8 in ANSI WM_CHAR, but in a surprising way.
Now, type some Tibetan characters into Keyview2A, and we see UTF8 sequence appear.
You see? The three UTF8 bytes are sent in a single WM_CHAR message, not across three WM_CHAR messages. This idea is OK because one UTF8 sequence has max length of 4 bytes which can be tucked in a WPARAM.
Now compare it with Keyview2U (the Unicode version), no matter UTF8ACP is on or off:
OK, U+0F45 is UTF8 [E0 BD 85], they match.
Something to mention:
?
) in WM_CHAR.-- What a damn inconsistency!
Final word today, Don't you think Microsoft's UTF8ACP enhancement to ANSI WM_CHAR is crappy? It enables Keyview2A to see no-codepage charset(like Tibetan)'s UTF8 sequence, but do NOT enable it to see those has-codepage charset's UTF8 sequence(you see above in my question, Keyview2A gets two 0x3F for a Chinese GBK character) -- genuinely ridiculous.
I really hope Keyview2A can get UTF8 sequence for every WM_CHAR -- and breaks many many legacy applications(by receiving wrong byte sequences for non-ASCII characters), and most people thinks that is what UTF8ACP should mean. No wonder Microsoft is still marking UTF8ACP feature as "Beta", and I guess the Beta status will keep going for many years, maybe 10~20 years.
Upvotes: 5