MSVC is double-encoding UTF-8 strings, why?

Question

So, here is some simple code to recreate my issue:

#include 

const char* badString = u8"ï–®ï–¶aÅ¡ð’€€"; 
const char* anotherBadString = u8"\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";
const char* goodString = "\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";

void printHex(const char* str)
{
    for (; *str; ++str)
    {
        printf("%02X ", *str & 0xFF);
    }
    puts("");
}

int main(int argc, char *argv[])
{
    printHex(badString);
    printHex(anotherBadString);
    printHex(goodString);

    return 0;
}

I would expect all of these strings to print out the same result, EF 96 AE EF 96 B6 61 C5 A1 F0 92 80 80 . However, in MSVC 2019, the first two strings print out C3 AF C2 96 C2 AE C3 AF C2 96 C2 B6 61 C3 85 C2 A1 C3 B0 C2 92 C2 80 C2 80. This seems to be a result of encoding into UTF-8 an extra time.

I've read in other threads that a solution to this problem is to add the /utf-8 flag to the project, but I've tried that and it doesn't make any difference. Is there something more fundamental that I'm not understanding here?

Thanks a bunch!

MSVC is double-encoding UTF-8 strings, why?

Answers (1)

Related Questions