rsk82
rsk82

Reputation: 29437

how to read utf-8 characters from windows console ? Seems that ReadConsoleOutputCharacter() can't handle them

here is the code that isolate the problem:

#include <iostream>
#include <windows.h>

using namespace std;

int main() {
  SetConsoleOutputCP(CP_UTF8);
  _wsystem(L"echo pure ascii, naïveté");
  COORD pos = {0,0};

  TCHAR* attempt1 = new TCHAR[14];
  DWORD charnum1;
  ReadConsoleOutputCharacter(GetStdHandle(STD_OUTPUT_HANDLE), attempt1, 14, pos, &charnum1);
  wcout << endl << "charnum1: " << charnum1 << ", attempt1: " << attempt1 << endl;
  wcout << "GetLastError: " << GetLastError();

  TCHAR* attempt2 = new TCHAR[16];
  DWORD charnum2;
  ReadConsoleOutputCharacter(GetStdHandle(STD_OUTPUT_HANDLE), attempt2, 16, pos, &charnum2);
  wcout << endl << "charnum2: " << charnum2 << ", attempt2: " << attempt2 << endl;
  wcout << "GetLastError: " << GetLastError();

  system("pause > nul");
}

output is:

pure ascii, naïveté

charnum1: 14, attempt1: pure ascii, na
GetLastError: 0
charnum2: 0, attempt2: x >
GetLastError: 0

First attempt works ok, but when function try to read over position with non-ASCII char then it returns nothing, nor any error is indicated. What to do now ?

Upvotes: 0

Views: 1166

Answers (1)

Mike C
Mike C

Reputation: 1238

Caveat: On my system, CP_UTF8 is not available, and so when I run your code the echo command results in "The system cannot write to the specified device."

However, if I remove the SetConsoleOutputCP() call and leave it at the default codepage, 437, I get the string displayed correctly.

Note, there are separate read and write codepages. I tried various combinations of 437, 850, 1252, and 28591 -- the latter two more-or-less map to Unicode's first 255 codepoints. If CP_UTF8 is working for you, re-try your code with a call to SetConsoleCP(CP_UTF8).

Note that ReadConsoleOutputCharacter() does not place a null after the last read character, so you've got a problem in your code when you output that TCHAR array: you have no guarantee it's null-terminated and it could crash. (Also, you're not deleting your allocated TCHAR arrays.) So, I changed the allocation lines to this:

TCHAR attempt1[] = L"____________________";  // 20 underscores

which (with no call to SetConsoleOutputCP()) yielded this:

charnum1: 14, attempt1: pure ascii, na______
charnum2: 16, attempt2: pure ascii, na∩v____

That next-to-last glyph in the second line isn't "n", it's the character 0xEF from codepage 437. "ï" is character 0xEF from Unicode. What's happening here is, the correct codepoint (0xEF) was read from the console, but the stream output continues to use the 437 codepage. Stream output selects its character based on the locale setting of the stream, not the codepage that's been set in the console.

I don't know why the desired codepoint value is read from the console when the console's READ codepage is still 437. I also am puzzled as to why, if I SetConsoleOutputCP(1252) (or 28591), the output of the echo command looks like it's using CP 437: pure ascii, na∩vitΘ

Upvotes: 2

Related Questions