Kristian Spangsege
Kristian Spangsege

Reputation: 2973

C++14: Conversion between UTF-8/UTF-16 and native character encoding

I have 4 closely related questions:

  1. Does C++14 have a built-in mechanism for converting between UTF-8 and the systems native multibyte encoding, i.e., the multibyte encoding assumed by the std::codecvt<wchar_t, char> specialization (http://en.cppreference.com/w/cpp/locale/codecvt)?

  2. Does C++14 have a built-in mechanism for converting between UTF-8 and the systems native wide character encoding, i.e., the wide character encoding assumed by the std::codecvt<wchar_t, char> specialization (http://en.cppreference.com/w/cpp/locale/codecvt)?

  3. Same as question 1, but for UTF-16 instead of UTF-8.

  4. Same as question 2, but for UTF-16 instead of UTF-8.

EDIT: I realize that a "yes" to any of these questions, effectively means "yes" to all 4, because C++14 clearly does provide ways of converting between UTF-8 and UTF-16 (std::codecvt<char16_t, char, std::mbstate_t>), as well as between native multibyte and native wide character encodings (std::codecvt<wchar_t, char>).

Upvotes: 2

Views: 1866

Answers (1)

Cubbi
Cubbi

Reputation: 47468

the systems native multibyte encoding, i.e., the multibyte encoding assumed by the std::codecvt<wchar_t, char> specialization

There is confusion, possibly due to misleading wording on cppreference (my fault, now fixed to match the standard and reality); in the existing implementations (libc++ and libstdc++), the locale-independent codecvt specialization codecvt<wchar_t, char> does not deal with any multibyte encodings. The standard wording is "native character sets for narrow and wide characters.", and the existing implementations took that to mean 1:1 conversions only, like what btowc/wctob do in C:

template<class F> struct facet : F { using F::F; ~facet() {} };
facet<std::codecvt<wchar_t, char, std::mbstate_t>> fp;
std::cout << fp.max_length() << '\n'; // prints 1 in libc++ and libstdc++

in fact, here's the libc++ implementation

In every useful context, multibyte encoding is either an encoding specified by a locale-provided codecvt facet, or by a custom codecvt facet, or UTF-8 (provided by the std::codecvt_utf8* facets). Meaning,

between UTF-8 and the systems native multibyte encoding

means "between UTF-8 and a multibyte encoding specified by a locale"

#include <codecvt>
#include <locale>
#include <cassert>

template<class F> struct myFacet : F { using F::F; ~myFacet() {} };
int main()
{
    std::string in = u8"水"; // UTF-8
    // utf8 to wide (could've used en_US.utf8, but this one exists as-is)
    std::wstring ws = std::wstring_convert<std::codecvt_utf8<wchar_t>>{}.from_bytes(in);
    assert(ws == L"水");
    // wide to another mb (have to use a named locale now)
    typedef myFacet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
    std::string out = std::wstring_convert<F>{ new F("zh_CN.gb18030") }.to_bytes(ws);
    assert(out == "\xcb\xae");
} 

Does C++14 have a built-in mechanism for converting between UTF-8 and the systems native wide character encoding

Native wide is effectively defined to be Unicode or (as on Windows) its arbitrary subset, and that's what you get from std::codecvt_utf8. A hostile implementation could possibly have a wchar_t holding values numerically different from the Unicode code points, as long as they map 1:1, but given that they must be equal to them for the basic charset, it is unrealistic.

Upvotes: 1

Related Questions