Danny
Danny

Reputation: 475

MSVC is double-encoding UTF-8 strings, why?

So, here is some simple code to recreate my issue:

#include <cstdio>

const char* badString = u8"aš𒀀"; 
const char* anotherBadString = u8"\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";
const char* goodString = "\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";

void printHex(const char* str)
{
    for (; *str; ++str)
    {
        printf("%02X ", *str & 0xFF);
    }
    puts("");
}

int main(int argc, char *argv[])
{
    printHex(badString);
    printHex(anotherBadString);
    printHex(goodString);

    return 0;
}

I would expect all of these strings to print out the same result, EF 96 AE EF 96 B6 61 C5 A1 F0 92 80 80 . However, in MSVC 2019, the first two strings print out C3 AF C2 96 C2 AE C3 AF C2 96 C2 B6 61 C3 85 C2 A1 C3 B0 C2 92 C2 80 C2 80. This seems to be a result of encoding into UTF-8 an extra time.

I've read in other threads that a solution to this problem is to add the /utf-8 flag to the project, but I've tried that and it doesn't make any difference. Is there something more fundamental that I'm not understanding here?

Thanks a bunch!

Upvotes: 1

Views: 252

Answers (1)

benrg
benrg

Reputation: 1909

The first character of the first string is ï (U+00EF, Latin Small Letter I With Diaeresis), whose UTF-8 encoding is C3 AF.

You apparently want the first string to begin with U+F5AE, but whatever editor you opened the source file in agrees with MSVC that it doesn't begin with that character.

The source file is probably encoded as UTF-8 with a BOM, and that's why the /utf-8 flag doesn't change anything. The string was corrupted at some point, and now its corrupted form is faithfully represented in the file, and MSVC is faithfully preserving it in the compiled code.

The second string begins with \xef, which MSVC is interpreting as equivalent to \u00ef, which is ï again. I can't find any clear statement in the C++20 draft standard regarding what \x is supposed to mean in UTF-8 strings (although I didn't look very hard). From experimentation, it appears that most compilers other than MSVC treat \x followed by hex digits as a literal byte, even if that makes the string not valid UTF-8. I think you shouldn't use \x in u8 prefixed strings because it isn't portable (except for \x00 through \x7f, probably). If you want U+F5AE then write \uf5ae.

Upvotes: 1

Related Questions