Different outputs when using C++11 \u vs \x to output a Unicode string?

Question

This is a simple program which should output the following four Unicode glyphs. There are four glyphs in total composed from five codepoints or 14 bytes with straight UTF-8.

My impression is that the output for these should be the same; one is simply a list of codepoints and the other is the UTF-8 encoded form of the same list.

Note that the some of these symbols may not be visible from your console. The horse head (equid) is expected to be invisible as it's most likely unsupported by any installed font on your system.

Not that the question is specifically regarding why the output is different; it seems as though the equid character is the problem?

You can also compile and run it here using gcc-5.1: https://ideone.com/Q31D9x

#include 

using namespace std;

int main() {
   cout << "\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
   cout << "\u0061\u0928\u093F\u4E9C\u10083" << endl;

   return 0;
}

Original image source: http://unicode.org/faq/char_combmark.html

Update

The corrected code is:

#include 

using namespace std;

int main() {
   cout << u8"\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
   cout << u8"\u0061\u0928\u093F\u4E9C\U00010083" << endl;

   return 0;
}

Felix Dombek · Accepted Answer

The parser must parse \u10083 by assuming that \u1008 is one unicode code point in the Basic Multilingual Plane, followed by a 3. What exactly the resulting representation will be depends on the type of your string (e.g., L"", u8"", u"", U""). For a string with no such prefix, the exact representation is implementation defined.

For code points outside of the BMP, there is the \U00010083 notation.

Different outputs when using C++11 \u vs \x to output a Unicode string?

Answers (2)

Related Questions