Reputation: 2614
This is a simple program which should output the following four Unicode glyphs. There are four glyphs in total composed from five codepoints or 14 bytes with straight UTF-8.
My impression is that the output for these should be the same; one is simply a list of codepoints and the other is the UTF-8 encoded form of the same list.
Note that the some of these symbols may not be visible from your console. The horse head (equid) is expected to be invisible as it's most likely unsupported by any installed font on your system.
Not that the question is specifically regarding why the output is different; it seems as though the equid character is the problem?
You can also compile and run it here using gcc-5.1: https://ideone.com/Q31D9x
#include <iostream>
using namespace std;
int main() {
cout << "\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
cout << "\u0061\u0928\u093F\u4E9C\u10083" << endl;
return 0;
}
Original image source: http://unicode.org/faq/char_combmark.html
Update
The corrected code is:
#include <iostream>
using namespace std;
int main() {
cout << u8"\x61\xE0\xA4\xA8\xE0\xA4\xBF\xE4\xBA\x9C\xF0\x90\x82\x83" << endl;
cout << u8"\u0061\u0928\u093F\u4E9C\U00010083" << endl;
return 0;
}
Upvotes: 1
Views: 912
Reputation: 14372
The parser must parse \u10083
by assuming that \u1008
is one unicode code point in the Basic Multilingual Plane, followed by a 3
. What exactly the resulting representation will be depends on the type of your string (e.g., L""
, u8""
, u""
, U""
). For a string with no such prefix, the exact representation is implementation defined.
For code points outside of the BMP, there is the \U00010083
notation.
Upvotes: 4
Reputation: 5279
Although Felix Dombek has given an answer, I'd like to explain string literals in C++11 a little bit.
\u
is not for UTF-16 or any other encoding. Escape sequences like \u
and \U
are encoding-agnostic. They only specify the code point, i.e. the number of character in the big unicode table of characters. It means that you can't tell the exact sequence of bytes which represents the string "\u5678"
.
The thing which specifies the encoding of string literal is a prefix like u"blabla"
. In that case the standard guarantees that the string will be encoded in UTF-16. One has to distinguish the purposes of string literal prefixes and unicode escape sequences: the first specify the encoding and the second specify actual characters (which can be represented by number of encodings).
References: http://en.cppreference.com/w/cpp/language/string_literal, http://en.cppreference.com/w/cpp/language/escape
Upvotes: 4