Reputation: 1236
Here i have some simple code:
#include <iostream>
#include <cstdint>
int main()
{
const unsigned char utf8_string[] = u8"\xA0";
std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
for (int i=0; i < sizeof(utf8_string); i++) {
std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
}
}
I see different behavior here with MSVC and GCC.
MSVC sees "\xA0"
as not encoded unicode sequence, and encodes it to utf-8.
So in MSVC the output is:
C2A0
Which is correctly encoded in utf8 unicode symbol U+00A0
.
But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8
before string literal.
Both compilers encode to utf8 with output C2A0
if the string is set to: u8"\u00A0";
Why do compilers behave differently and which actually does it right?
Software used for test:
GCC 8.3.0
MSVC 19.00.23506
C++ 11
Upvotes: 10
Views: 4356
Reputation: 76688
This is CWG issue 1656.
It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.
Therefore GCC's behavior is/will be correct.
Upvotes: 2
Reputation: 308206
I can't tell you which way is true to the standard.
The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x
, \u
, and \U
behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.
Upvotes: 1
Reputation: 21416
Why do compilers behave differently and which actually does it right?
Compilers behave differently because of the way they decided to implement the C++ standard:
So things that fail in GCC will usually work in MSVC because it's more permissible. And MSVC handles some of these issues automatically.
Here is a similar example: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167. It follows the standard, but it's not what you would expect.
As to which does it right, depends on what your definition of "right" is.
Upvotes: 0
Reputation: 7184
They're both wrong.
As far as I can tell, the C++17 standard says here that:
The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.
Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.
There are tickets for this which are currently marked as Under Investigation:
However it also says here about UTF-8 literals that:
If the value is not representable with a single UTF-8 code unit, the program is ill-formed.
Since 0xA0
is not a valid UTF-8 character, the program should not compile.
Note that:
u8
are defined as being narrow.\xA0
is an escape sequence\u00A0
is considered a universal character name and not an escape sequenceUpvotes: 3