Valentin
Valentin

Reputation: 1158

Unicode: string literals and character literals

I am trying to understand how I should combine u8"" and "\uxxxx" syntax to get a UTF-8 encoded string. Can I use the latter inside of the former? Should I? How about "\x"?

I wrote this code snippet which encodes Я (Я) in 4 different ways:

#include <iostream>
#include <bitset>

int main()
{
    std::string s1 = "\xD0\xAF";
    std::string s2 = u8"\xD0\xAF";
    std::string s3 = "\u042F";
    std::string s4 = u8"\u042F";

    for(unsigned char c : s1)
        std::cout << std::hex << int(c) << ' ';
    std::cout << std::endl;

    for(unsigned char c : s2)
        std::cout << std::hex << int(c) << ' ';
    std::cout << std::endl;

    for(unsigned char c : s3)
        std::cout << std::hex << int(c) << ' ';
    std::cout << std::endl;

    for(unsigned char c : s4)
        std::cout << std::hex << int(c) << ' ';
    std::cout << std::endl;

    return 0;
}

The results are confusing. Both Clang and GCC produced this:

d0 af 
d0 af 
d0 af 
d0 af 

(which is great and means that I don't need to worry about it), however VS produced this:

d0 af 
c3 90 c2 af 
3f 
d0 af 

So looks like the proper portable way of doing this is std::string s4 = u8"\u042F";. Is that correct? Is the output of my program UB or is this a bug in VS?

Upvotes: 1

Views: 106

Answers (1)

Chris Dodd
Chris Dodd

Reputation: 126140

According to section 2.3 (Character sets) of the C++ spec:

Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.

which certainly applies to s3's initializer, so you get undefined behavior here. Other than that I can't see anything wrong with the code.

In the s2 case, VS seems to be treating each of the characters as a unicode code point and encoding it individually in utf-8. I don't see anything in the spec saying that is wrong, or right.

Upvotes: 1

Related Questions