Hunter
Hunter

Reputation: 151

wchar_t and encoding

If I want to convert a piece of string to UTF-16, say char * xmlbuffer, do I have to convert the type to wchar_t * before encoding to UTF-16? And is char* type reqired before encoding to UTF-8?

How is wchar_t, char related to UTF-8 or UTF-16 or UTF-32 or other transformation format?

Thanks in advance for help!

Upvotes: 6

Views: 8852

Answers (3)

dreamlax
dreamlax

Reputation: 95335

iconv is a POSIX function that can take care of the intermediate encoding step. You can use iconv_open to specify that you have UTF-8 input and that you want UTF-16 output. Then, using the handle returned from iconv_open, you can use iconv (specifying your input buffer and output buffer). When you are done you must call iconv_close on the handle returned from iconv_open to free resources etc.

You will have to peruse your system's documentation about what encodings are supported by iconv and their naming scheme (i.e. what to provide iconv_open). For example, iconv on some systems expect "utf-8" and others it may expect "UTF8" etc.

Windows does not provide a version of iconv, and instead provides it's own UTF formatting functions: MultiByteToWideChar and WideCharToMultiByte.

//UTF8 to UTF16
std::string input = ...
int utf16len = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), input.size(), 
                                               NULL, 0);
std::wstring output(utf16len);
MultiByteToWideChar(CP_UTF8, 0, input.c_str(), input.size(), 
                                &output[0], output.size());
//UTF16 to UTF8
std::wstring input = ...
int utf8len = WideCharToMultiByte(CP_UTF8, 0, input.c_str(), input.size(), 
                                              NULL, 0, NULL, NULL);
std::string output(utf8len);
WideCharToMultiByte(CP_UTF8, 0, input.c_str(), input.size(),
                                &output[0], output.size(), NULL, NULL);

Upvotes: 5

Jon
Jon

Reputation: 437376

No, you don't have to change data types.

About wchar_t: the standard says that

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales.

Unfortunately, it does not say what encoding wchar_t is supposed to have; this is implementation-dependent. So for example given

auto s = L"foo";

you can make absolutely no assumption about what the value of the expression *s is.

However, you can use an std::string as an opaque sequence of bytes that represent text in any transformation format of your choice without issue. Just don't perform standard library string-related operations on it.

Upvotes: 5

damienh
damienh

Reputation: 164

The size of wchar_t is compiler dependent, so its relation to the various unicode formats will vary.

Upvotes: 1

Related Questions