what is System::String constructor encoding?

Question

If I create a utf8 encoded char array and pass the pointer to a string like this

char buffer[100];
CreateMyUTF8EncodedBytes(buffer, "some string with fancy characters like ö");
auto s = gcnew String(buffer);

most of it is correct but the non ASCII characters are replaced by gibberish. I double checked the buffer data, it is correct, in fact, if I convert the buffer into a managed array and feed it to system::text::encoding::utf8::getstring then it returns the correct string.

Its also not ASCII, if I fill the buffer with a const char* literal, i get non-ascii values on some characters and the result is correct.

So obviously whatever the string constructor is trying to process, its not UTF8 nor ASCII. What encoding is it using? Can I change it?

Hans Passant · Accepted Answer

You are using the String(SByte*) constructor. It assumes the bytes are encoded according to the system default code page, Encoding::Default. While that could be utf-8, the odds for that are vanishingly small, machines don't come out of the box that way. It depends where you live, in Western Europe and the Americas it is code page 1252 for example.

Yes, you must use Encoding::UTF8 if you know that buffer contains utf-8 encoded bytes.

Do beware that you still don't know that much about the encoding for the string argument to your CreateMyUTF8EncodedBytes() function. That depends on the encoding that's used by your text editor and the encoding that the compiler guessed at. Using UTF-8 with a BOM is best. UTF-8 so your program still compiles correctly when your source file travels a thousand miles. And a BOM so the compiler doesn't have to guess at it. If you use VS then that's controlled by File > Save As > arrow on the Save button > Save with Encoding > select "Unicode (UTF-8 with signature)". Note how that makes CreateMyUTF8EncodedBytes() a no-op function :)

what is System::String constructor encoding?

Answers (1)

Related Questions