Reputation: 3249
I want to convert a string encoded in a doublebyte code page into an UTF-16 string using std::codecvt<wchar_t, char, std::mbstate_t>::in()
on the Microsoft standard library implementation (MSVC11). For example, consider the following program:
#include <iostream>
#include <locale>
int main()
{
// KATAKANA LETTER A (U+30A2) in Shift-JIS (Codepage 932)
// http://msdn.microsoft.com/en-us/goglobal/cc305152
char const cs[] = "\x83\x41";
std::locale loc = std::locale("Japanese");
// Output: "Japanese_Japan.932" (as expected)
std::cout << loc.name() << '\n';
typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t;
cvt_t const& codecvt = std::use_facet<cvt_t>(loc);
wchar_t out = 0;
std::mbstate_t mbst = std::mbstate_t();
char const* mid;
wchar_t* outmid;
// Output: "2" (error) (expected: "0" (ok))
std::cout << codecvt.in(
mbst, cs, cs + 2, mid,
&out, &out + 1, outmid) << '\n';
// Output: "0" (expected: "30a2")
std::cout << std::hex << out << '\n';
}
When debugging, I found out that in()
ends up calling the internal _Mbrtowc()
function (crt\src\xmbtowc.c), passing the internal (C?) part of the std::locale
, initialized with {_Page=932 _Mbcurmax=2 _Isclocale=0 ...}
, where ... stands for (and this seems to be the problem) the _Isleadbyte
member, initialized to an array of 32 zeros (of type unsigned char). Thus, when the function processes the '\x32'
lead byte, it checks with this array and naturally comes to the (wrong) conclusion that this is not a lead byte. So it happily calls the MultiByteToWideChar()
Win-API function, which, of course, fails to convert the halfed character. So, _Mbrtowc()
returns the error code -1, which more or less cancels everything up the call stack and ultimately the 2 (std::codecvt_base::result::error
) is returned.
Is this a bug in the MS standard library (it seems so)? (How) can I work around this in a portable way (i.e. with the least amount of #ifdef
s)?
Upvotes: 2
Views: 699
Reputation: 3684
I reported it internally to Microsoft. The have now filled it as a new bug (DevDiv#737880). But I recomment to fill out a connect item at: http://connect.microsoft.com/VisualStudio
Upvotes: 1
Reputation: 447
I copy pasted your code in VC2010 / Windows 7 64-bit.
It works as you expect. Here's the output:
Japanese_Japan.932
0
30a2
It's probably a bug introduced with VC2012...
Upvotes: 1