Jean Davy
Jean Davy

Reputation: 2240

sending an std::wstring from a "chinese Windows" sent to an "arabic Windows"

I have a socket through which I sent serialized std::wstring, so for example let's say, from an chinese version of Windows to an Unix system working with Arabic UI.

I can't understand how my Unix system (or anything else) will know that these std::wstring are chinese one's, I don't see that "code page" (aka country language ?) is stored in std::wstring, do I have to associate myself the code page of potentially each std::wstring ?

I miss something for sure as this looks as a basic question in our interconnected world ...

Thanks,

Upvotes: 1

Views: 222

Answers (2)

bdonlan
bdonlan

Reputation: 231303

Generally speaking, wstring is encoded as a unicode encodings, which are language neutral. No matter what language settings you're using on your own computer, the content of the wstring is the same (this is one of the main advantages of Unicode!).

However, note that there is more than one Unicode encoding, and Unix platforms often use a different one from Windows (UCS-32 vs UTF-16). I would recommend converting explicitly to UTF-8 for transfer between machines; on Windows use WideCharToMultiByte (with CP_UTF8), and on unixen use iconv() to convert between your local wstring encoding and UTF-8 (on Unix, it's more common to simply use UTF-8 everywhere, note - in that case you'd use a normal std::string with UTF-8 text in it on the Unix side).

Upvotes: 1

Mark Ransom
Mark Ransom

Reputation: 308402

The purpose of wstring is to allow the entire Unicode character set, which includes Chinese and Arabic and every other character set known to man. It almost completely obsoletes the concept of a code page - the characters have the same representation on a computer based in any language.

See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for starters.

You might run into some trouble with the transfer, as wchar_t will typically be 16 bits on Windows and 32 bits on Linux. There also might be some big-endian vs. little-endian issues to worry about. The safest course of action is to transfer via UTF-8, which encodes the Unicode characters into sequences of 8-bit bytes which are unambiguous.

Upvotes: 3

Related Questions