toozyfuzzy
toozyfuzzy

Reputation: 1236

C++ utf-8 literals in GCC and MSVC

Here i have some simple code:

#include <iostream>
#include <cstdint>

    int main()
    {
         const unsigned char utf8_string[] = u8"\xA0";
         std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
          for (int i=0; i < sizeof(utf8_string); i++) {
            std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
          }
    }

I see different behavior here with MSVC and GCC. MSVC sees "\xA0" as not encoded unicode sequence, and encodes it to utf-8. So in MSVC the output is:

C2A0

Which is correctly encoded in utf8 unicode symbol U+00A0.

But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8 before string literal.

Both compilers encode to utf8 with output C2A0 if the string is set to: u8"\u00A0";

Why do compilers behave differently and which actually does it right?

Software used for test:

GCC 8.3.0

MSVC 19.00.23506

C++ 11

Upvotes: 10

Views: 4356

Answers (4)

user17732522
user17732522

Reputation: 76688

This is CWG issue 1656.

It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.

Therefore GCC's behavior is/will be correct.

Upvotes: 2

Mark Ransom
Mark Ransom

Reputation: 308206

I can't tell you which way is true to the standard.

The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x, \u, and \U behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.

Upvotes: 1

Cosmin
Cosmin

Reputation: 21416

Why do compilers behave differently and which actually does it right?

Compilers behave differently because of the way they decided to implement the C++ standard:

  • GCC uses strict rules and implements the standard as is
  • MSVC uses loose rules and implements the standard in a more practical "real-world" kind of way

So things that fail in GCC will usually work in MSVC because it's more permissible. And MSVC handles some of these issues automatically.

Here is a similar example: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167. It follows the standard, but it's not what you would expect.

As to which does it right, depends on what your definition of "right" is.

Upvotes: 0

Etienne Laurin
Etienne Laurin

Reputation: 7184

They're both wrong.

As far as I can tell, the C++17 standard says here that:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.

There are tickets for this which are currently marked as Under Investigation:

However it also says here about UTF-8 literals that:

If the value is not representable with a single UTF-8 code unit, the program is ill-formed.

Since 0xA0 is not a valid UTF-8 character, the program should not compile.

Note that:

  • UTF-8 literals starting with u8 are defined as being narrow.
  • \xA0 is an escape sequence
  • \u00A0 is considered a universal character name and not an escape sequence

Upvotes: 3

Related Questions