Patrick Storz
Patrick Storz

Reputation: 495

Avoid / set character set conversion /encoding for std::cout / std::cerr

General question

Is there a possibility to avoid character set conversion when writing to std::cout / std::cerr? I do something like

std::cout << "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" << std::endl;

And I want the output to be written to the console maintaining the UTF-8 encoding (my console uses UTF-8 encoding, but my C++ Standard Library, GNUs libstdc++, doesn't think so for some reason).

If there's no possibility to forbid character encoding conversion: Can I set std::cout to use UTF-8, so it hopefully figures out itself that no conversion is needed?


Background

I used the Windows API function SetConsoleOutputCP(CP_UTF8); to set my console's encoding to UTF-8. The problem seems to be that UTF-8 does not match the code page typicallly used for my system's locale and libstdc++ therefore sets up std::cout with the default ANSI code page instead of correctly recognizing the switch.



Edit: Turns out I misinterpreted the issue and the solution is actually a lot simpler (or not...).

The "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" was just meant as a placeholder (and I shouldn't have used it as it has hidden the actual issue).

In my real code the "UTF-8 string" is a Glib::ustring, and those are by definition UTF-8 encoded. However I did not realize that the output operator << was defined in glibmm in a way that forces character set conversion.
It uses g_locale_from_utf8() internally which in turn uses g_get_charset() to determine the target encoding.

Unfortunately the documentation for g_get_charset() states

On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.

which simply means that glib will neither care for the C locale I set nor will it attempt to determine the encoding my console actually uses and basically makes it impossible to use many glib functions to create UTF-8 output. (As a matter of fact this also means that this issue has the exact same cause as the issue that triggered my other question: Force UTF-8 encoding in glib's "g_print()").

I'm currently considering this a bug in glib (or a serious limitation at best) and will probably open a report in the issue tracker for it.

Upvotes: 2

Views: 529

Answers (1)

Luis Colorado
Luis Colorado

Reputation: 12668

You are looking at the wrong side, as you are talking about a string literal, included in your source code (and not input from your keyboard), and for that to work properly you have to tell the compiler which encoding is being used for all those characters (I think the first c++ spec that mentions non-ascii charsets is c++11)

As you are using actually the UTF charset, you should have to encode all them in at least a wchar_t to be considered as such, or to agree in the translator (probably this is what happens) that UTF chars will be UTF-8 encoded, when used as string literals. This will commonly mean that they will be printed as UTF-8 and, if you use a UTF-8 compliant console device, they will be printed ok, without any other problem.

I know there's a gcc option to specify the encoding used in string literals for a source file, and there should be another in clang also. Check the documentation and probably this will solve any issues. But the best thing to be portable, is not to depend on the codeset or use one like ISO-10646 (but know that full utf coverage is not only utf-8, utf-8 is only a way to encode UTF chars, and as so, it's only a way to represent UTF characters)

Another issue, is that C++11 doesn't refer to the UTF consortium standard, but to the ISO counterpart (ISO-10646, I think), both are similar, but not equal, and the character encodings are similar, but not equal (the codesize of the ISO is 32 bit while the Unicode consortium's is 21 bit, for example). These and other differences between them make some tricks to go in C++ and produce problems when one is thinking in strict Unicode.

Of course, to output correct strings on a UTF-8 terminal, you have to encode UTF codes to utf-8 format before sending them to the terminal. This is true, even if you have already them encoded as utf-8 in a string object. If you say they are already utf-8 then no conversion is made at all... but if you don't say, the normal consideration is that you are using normal utf codes (but limiting to 8bit codes), limiting yourself to eight bit codes, and encoding them to utf-8 before printing... this leads to encoding errors (double encoding) as something like ú (unicode code \u00fa) should be encoded in utf-8 as the character sequence { 0xc3, 0xba };, but if you don't say the string literal is indeed in utf-8, both characters will be handled as the two characters codes for Â(\u00c3) and º(\u00ba) characters, and will be recoded as { 0xc3, 0x83, 0xc2, 0xba }; that will show them incorrectly. This is very common error and you should probably have seen it when some encoding is done incorrectly. Source for the samples here.

Upvotes: 1

Related Questions