Reputation: 2240
I have a socket through which I sent serialized std::wstring, so for example let's say, from an chinese version of Windows to an Unix system working with Arabic UI.
I can't understand how my Unix system (or anything else) will know that these std::wstring are chinese one's, I don't see that "code page" (aka country language ?) is stored in std::wstring, do I have to associate myself the code page of potentially each std::wstring ?
I miss something for sure as this looks as a basic question in our interconnected world ...
Thanks,
Upvotes: 1
Views: 222
Reputation: 231303
Generally speaking, wstring
is encoded as a unicode encodings, which are language neutral. No matter what language settings you're using on your own computer, the content of the wstring
is the same (this is one of the main advantages of Unicode!).
However, note that there is more than one Unicode encoding, and Unix platforms often use a different one from Windows (UCS-32 vs UTF-16). I would recommend converting explicitly to UTF-8 for transfer between machines; on Windows use WideCharToMultiByte
(with CP_UTF8
), and on unixen use iconv()
to convert between your local wstring
encoding and UTF-8 (on Unix, it's more common to simply use UTF-8 everywhere, note - in that case you'd use a normal std::string
with UTF-8 text in it on the Unix side).
Upvotes: 1
Reputation: 308402
The purpose of wstring
is to allow the entire Unicode character set, which includes Chinese and Arabic and every other character set known to man. It almost completely obsoletes the concept of a code page - the characters have the same representation on a computer based in any language.
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for starters.
You might run into some trouble with the transfer, as wchar_t
will typically be 16 bits on Windows and 32 bits on Linux. There also might be some big-endian vs. little-endian issues to worry about. The safest course of action is to transfer via UTF-8, which encodes the Unicode characters into sequences of 8-bit bytes which are unambiguous.
Upvotes: 3