DaedalusAlpha
DaedalusAlpha

Reputation: 1737

How does convertion between char and wchar_t work in Windows?

In Windows there are the functions like mbstowcs to convert between char and wchar_t. There are also C++ functions such as from_bytes<std::codecvt<wchar_t, char, std::mbstate_t>> to use.

But how does this work beind the scenes as char and wchar_t are obviously of different size? I assume the system codepage is involved in some way? But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?

Also what happens if code that has to use char (maybe due to a library) is moved between computers with different codepages? Say that it is only using numbers (0-9) which are well within the range of ASCII, would that always be safe?

And finally, what happens on computers where the local language can't be represented in 256 characters? In that case the concept of char seems completely irrelevant other than for storing for example utf8.

Upvotes: 0

Views: 1231

Answers (3)

Johann Gerell
Johann Gerell

Reputation: 25601

I what is in the char array is actually a UTF-8 encoded string, then you can convert it to and from a UTF-16 encoded wchar_t array using

#include <locale>
#include <codecvt>
#include <string>

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);

as described in more detail at https://stackoverflow.com/a/18597384/6345

Upvotes: 0

Dietmar K&#252;hl
Dietmar K&#252;hl

Reputation: 154045

I don't know about mbstowcs() but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>. The latter travels in terms of two types:

  • A character type cT which is in your code wchar_t.
  • A byte type bT which is normally char.

The third type in play, std::mbstate_t, is used to store any intermediate state between calls to the std::codecvt<...> facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.

Each instance of std::codecvt<...> implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8 for char and UTF16 or UCS4 for wchar_t (depending on whether wchar_t uses 16 or 32 bits).

The decoding conversions done by std::codecvt<...> take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t when necessary (e.g., when wchar_t is 16 bit).

The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.

Upvotes: 0

Emilio Garavaglia
Emilio Garavaglia

Reputation: 20769

It all depends on the cvt facet used, as described here

In your case, (std::codecvt<wchar_t, char, std::mbstate_t>) it all boils down to mbsrtowcs / wcsrtombs using the global locale. (that is the "C" locale, if you don't replace it with the system one)

Upvotes: 1

Related Questions