Zhro
Zhro

Reputation: 2614

Different outputs when using C++11 \u vs \x to output a Unicode string?

This is a simple program which should output the following four Unicode glyphs. There are four glyphs in total composed from five codepoints or 14 bytes with straight UTF-8.

My impression is that the output for these should be the same; one is simply a list of codepoints and the other is the UTF-8 encoded form of the same list.

Note that the some of these symbols may not be visible from your console. The horse head (equid) is expected to be invisible as it's most likely unsupported by any installed font on your system.

Not that the question is specifically regarding why the output is different; it seems as though the equid character is the problem?

You can also compile and run it here using gcc-5.1: https://ideone.com/Q31D9x

#include <iostream>

using namespace std;

int main() {
   cout << "\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
   cout << "\u0061\u0928\u093F\u4E9C\u10083" << endl;

   return 0;
}

Original image source: http://unicode.org/faq/char_combmark.html

Update

The corrected code is:

#include <iostream>

using namespace std;

int main() {
   cout << u8"\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
   cout << u8"\u0061\u0928\u093F\u4E9C\U00010083" << endl;

   return 0;
}

Upvotes: 1

Views: 912

Answers (2)

Felix Dombek
Felix Dombek

Reputation: 14372

The parser must parse \u10083 by assuming that \u1008 is one unicode code point in the Basic Multilingual Plane, followed by a 3. What exactly the resulting representation will be depends on the type of your string (e.g., L"", u8"", u"", U""). For a string with no such prefix, the exact representation is implementation defined.

For code points outside of the BMP, there is the \U00010083 notation.

Upvotes: 4

Oleg Andriyanov
Oleg Andriyanov

Reputation: 5279

Although Felix Dombek has given an answer, I'd like to explain string literals in C++11 a little bit.

\u is not for UTF-16 or any other encoding. Escape sequences like \u and \U are encoding-agnostic. They only specify the code point, i.e. the number of character in the big unicode table of characters. It means that you can't tell the exact sequence of bytes which represents the string "\u5678".

The thing which specifies the encoding of string literal is a prefix like u"blabla". In that case the standard guarantees that the string will be encoded in UTF-16. One has to distinguish the purposes of string literal prefixes and unicode escape sequences: the first specify the encoding and the second specify actual characters (which can be represented by number of encodings).

References: http://en.cppreference.com/w/cpp/language/string_literal, http://en.cppreference.com/w/cpp/language/escape

Upvotes: 4

Related Questions