aatwo
aatwo

Reputation: 1008

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?

std::string convert_string(const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    return conv.to_bytes(str);
}

Hello. I have a C++ app on Windows which takes some user input on the command line. I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.

This function can be found in many places online and works in almost all cases. I have however found a few examples where it doesn't seem to work as expected.

For example if I input an emojii character "🤢" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm):

    0xF0 0x9F 0xA4 0xA2

However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:

    0xED 0xA0 0xBE 0xED 0xB4 0xA2

While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii. Also WinMerge considers the two files to be identical.

so to conclude I would really like to know the following:

  1. how the incorrect-looking converted hex values map correctly to the right utf8 character in the example above
  2. why the conversion function converts some characters to this form while almost all other characters produce the expected raw bytes
  3. As a bonus I would like to know if it is possible to modify the conversion function to stop it from outputting these rare characters in this form

I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream :)

std::string convert_string(const std::wstring& wstr)
{
    if(wstr.empty())
        return std::string();

    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

Upvotes: 1

Views: 926

Answers (2)

Stu
Stu

Reputation: 341

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).

You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.

Upvotes: 7

Christophe
Christophe

Reputation: 73627

There is already a validated answer here. But for the records, here some additional information.

The encoding of the nauseated face emoji was introduced in Unicode in 2016. It is 4 utf-8 bytes (0xF0 0x9F 0xA4 0xA2) or 2 utf-16 words (0xD83E 0xDD22).

The surprising encoding of 0xED 0xA0 0xBE 0xED 0xB4 0xA2 corresponds in fact to an UCS surrogate pair:

  • 0xED 0xA0 0xBE is the utf8 encoding of the high surrogate 0xD83E according to this conversion table.
  • 0xED 0xB4 0xA2 corresponds to the utf8 encoding of the low surrogate 0xDD22 according to this table.

So basically, your first encoding is the direct utf8. The second encoding is the encoding in utf8 of an UCS-2 encoding that corresponds to the utf-16 encoding of the desired character.

As the accepted answer rightly pointed out, the std::codecvt_utf8<wchar_t> is the culprit, because it's about UCS-2 and not UTF-16.

It's quite astonishing nowadays to find in standard libraries this obsolete encoding, but I suspect that this is still a reminiscence of Microsoft's lobying in the standard committee that dates back from the old Windows support for unicode with UCS-2.

Upvotes: 2

Related Questions