Jimm Chen
Jimm Chen

Reputation: 3797

Windows producing UTF8 sequence for ANSI version WM_CHAR? Why I cannot see it?

In today(2023.01)'s MSDN https://learn.microsoft.com/en-us/windows/win32/inputdev/wm-char , Microsoft says that:

... Otherwise(using ANSI version of RegisterClass), the system provides characters in the current process code page, which can be set to UTF-8 in Windows Version 1903 (May 2019 Update) and newer.

But, I just can NOT see WM_CHAR presenting Unicode character in UTF8 sequence. Am I doing wrong, or the document is wrong/misleading?

I do the experiments on Win10.21H2, using Keyview2A.exe v1.8, which is based on Charles Petzold's Keyview2 demo program in his famous book Programming Windows 5th-ed (1998).

I'm trying on Win10.21H2 .

First, the non-UTF8ACP case to show that KeyviewA works OK.

I try to type a Chinese character , which is U+7535, and GBK encoding B5 E7.

non-UTF8APC, type in a non-ASCII char

non-UTF8APC, OK we get ANSI sequence in codepage 936

Second, the UTF8ACP case does NOT give KeyviewA UTF8 sequence.

I just got 0x3F(?), sigh!

UTF8ACP, type in a Unicode char

UTF8ACP, does not see UTF8 sequence in WM_CHAR

Third, what about those characters from SBCS?

SBCS = Single-byte character set. DBCS = Double-byte character set. MBCS = Multi-byte character set. (generic name for SBCS, DBCS and 3+byte character set)

Most European countries use such character set.

Type in some Russian letters:

Type in some Russian letters.

Type in some Greek letters:

Type in some Greek letters

[20230121.c1] So far, I seem to have found out the rule about "enabling UTF8ACP", for an ANSI(narrow-char) program. Summarized below:

The IME produces Unicode value for any human-input character. When Windows need to send that character to KeyviewA, it does the following:

Upvotes: 2

Views: 504

Answers (1)

Jimm Chen
Jimm Chen

Reputation: 3797

Looks like I need to answer my own question after some investigation. The answer is something that you and I cannot acquire by merely reading MSDN.

I did see UTF8 in ANSI WM_CHAR, but in a surprising way.

  • First, turn on UTF8ACP on Windows 10.
  • Second, add a Tibetan(藏语) keyboard layout.
  • Third, run Keyview2A v1.9 (ANSI version) which I have just updated to deal with the very case.

Now, type some Tibetan characters into Keyview2A, and we see UTF8 sequence appear.

Add Tibetan keyboard layout

Type some Tibetan characters into Keyview2A

You see? The three UTF8 bytes are sent in a single WM_CHAR message, not across three WM_CHAR messages. This idea is OK because one UTF8 sequence has max length of 4 bytes which can be tucked in a WPARAM.

Now compare it with Keyview2U (the Unicode version), no matter UTF8ACP is on or off:

Type some Tibetan characters into Keyview2U

OK, U+0F45 is UTF8 [E0 BD 85], they match.

Something to mention:

  • If Keyview2A runs in UTF8ACP-off env, it still gets 0x3F(?) in WM_CHAR.
  • Why Tibetan is so special? I think that's because the industry never has defined a codepage for Tibetan(just call it no-codepage charset). To encode Tibetan text, you have to encode it in Unicode. Beside Tibetan, I think there are Bengali, Gujarati, Tamil etc.
  • Does every no-codepage charset produce UTF8 sequence in one WM_CHAR message? No, as I find out later! Amharic(spoken in Ethiopia) is an example. It sends UTF8 sequence in a series of WM_CHAR messages, each for one byte. See image below.

UTF8ACP, typing an Amharic letter

-- What a damn inconsistency!

Final word today, Don't you think Microsoft's UTF8ACP enhancement to ANSI WM_CHAR is crappy? It enables Keyview2A to see no-codepage charset(like Tibetan)'s UTF8 sequence, but do NOT enable it to see those has-codepage charset's UTF8 sequence(you see above in my question, Keyview2A gets two 0x3F for a Chinese GBK character) -- genuinely ridiculous.

I really hope Keyview2A can get UTF8 sequence for every WM_CHAR -- and breaks many many legacy applications(by receiving wrong byte sequences for non-ASCII characters), and most people thinks that is what UTF8ACP should mean. No wonder Microsoft is still marking UTF8ACP feature as "Beta", and I guess the Beta status will keep going for many years, maybe 10~20 years.

Upvotes: 5

Related Questions