Reputation: 1008
Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?
std::string convert_string(const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
return conv.to_bytes(str);
}
Hello. I have a C++ app on Windows which takes some user input on the command line. I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.
This function can be found in many places online and works in almost all cases. I have however found a few examples where it doesn't seem to work as expected.
For example if I input an emojii character "🤢" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm):
0xF0 0x9F 0xA4 0xA2
However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:
0xED 0xA0 0xBE 0xED 0xB4 0xA2
While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii. Also WinMerge considers the two files to be identical.
so to conclude I would really like to know the following:
I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream :)
std::string convert_string(const std::wstring& wstr)
{
if(wstr.empty())
return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
Upvotes: 1
Views: 926
Reputation: 341
The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>>
converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).
You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>
instead.
Upvotes: 7
Reputation: 73627
There is already a validated answer here. But for the records, here some additional information.
The encoding of the nauseated face emoji was introduced in Unicode in 2016. It is 4 utf-8 bytes (0xF0 0x9F 0xA4 0xA2
) or 2 utf-16 words (0xD83E 0xDD22
).
The surprising encoding of 0xED 0xA0 0xBE 0xED 0xB4 0xA2
corresponds in fact to an UCS surrogate pair:
0xED 0xA0 0xBE
is the utf8 encoding of the high surrogate 0xD83E
according to this conversion table. 0xED 0xB4 0xA2
corresponds to the utf8 encoding of the low surrogate 0xDD22
according to this table. So basically, your first encoding is the direct utf8. The second encoding is the encoding in utf8 of an UCS-2 encoding that corresponds to the utf-16 encoding of the desired character.
As the accepted answer rightly pointed out, the std::codecvt_utf8<wchar_t>
is the culprit, because it's about UCS-2 and not UTF-16.
It's quite astonishing nowadays to find in standard libraries this obsolete encoding, but I suspect that this is still a reminiscence of Microsoft's lobying in the standard committee that dates back from the old Windows support for unicode with UCS-2.
Upvotes: 2