Reputation: 1158
I am trying to understand how I should combine u8""
and "\uxxxx"
syntax to get a UTF-8 encoded string. Can I use the latter inside of the former? Should I? How about "\x"
?
I wrote this code snippet which encodes Я (Я
) in 4 different ways:
#include <iostream>
#include <bitset>
int main()
{
std::string s1 = "\xD0\xAF";
std::string s2 = u8"\xD0\xAF";
std::string s3 = "\u042F";
std::string s4 = u8"\u042F";
for(unsigned char c : s1)
std::cout << std::hex << int(c) << ' ';
std::cout << std::endl;
for(unsigned char c : s2)
std::cout << std::hex << int(c) << ' ';
std::cout << std::endl;
for(unsigned char c : s3)
std::cout << std::hex << int(c) << ' ';
std::cout << std::endl;
for(unsigned char c : s4)
std::cout << std::hex << int(c) << ' ';
std::cout << std::endl;
return 0;
}
The results are confusing. Both Clang and GCC produced this:
d0 af
d0 af
d0 af
d0 af
(which is great and means that I don't need to worry about it), however VS produced this:
d0 af
c3 90 c2 af
3f
d0 af
So looks like the proper portable way of doing this is std::string s4 = u8"\u042F";
. Is that correct? Is the output of my program UB or is this a bug in VS?
Upvotes: 1
Views: 106
Reputation: 126140
According to section 2.3 (Character sets) of the C++ spec:
Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.
which certainly applies to s3's initializer, so you get undefined behavior here. Other than that I can't see anything wrong with the code.
In the s2 case, VS seems to be treating each of the characters as a unicode code point and encoding it individually in utf-8. I don't see anything in the spec saying that is wrong, or right.
Upvotes: 1