kverkagambo
kverkagambo

Reputation: 49

How do I correctly initialize wide character string?

I am trying to figure out wide characters in c. For example, I test a string that contains a single letter "Ē" that is encoded as c492 in utf8.

char* T1 = "Ē";
//This is the resulting array { 0xc4, 0x92, 0x00 }

wchar_t* T2 = L"Ē";
//This is the resulting array { 0x00c4, 0x2019, 0x0000 }

I expected that the second array would be {0xc492, 0x0000}, instead it contains an extra character that just wastes space in my opinion. Can anyone help me understand what is going on with this?

Upvotes: 2

Views: 1753

Answers (2)

What you've managed to do here is mojibake. Your source code is written in UTF-8 but it was interpreted in Windows codepage 1252 (i.e. the compiler source character set was CP1252).

The wide string contents are the Windows codepage 1252 characters of the UTF-8 bytes 0xC4 0x92 converted to UCS-2. The easiest way out is to just using an escape instead:

wchar_t* T2 = L"\x112";

or

wchar_t* T2 = L"\u0112";

The larger problem is that to my knowledge neither C nor C++ have a mechanism for specifying the source character set within the code itself, so it is always a setting or option external to something that you can easily copy-paste.

Upvotes: 5

Michael Karcher
Michael Karcher

Reputation: 4111

Your compiler is misinterpreting your source code file (which is saved as UTF-8) as Windows-1252 (commonly called ANSI). It does not interpret the byte sequence C4 92 as the one-character UTF-8 string "Ē", but as the two-character Windows-1252 string "Ä’". The unicode codepoint of "Ä" is U+00C4, and the unicode codepoint of "’" is U+2019. This is exactly what you see in your wide character string.

The 8-bit string only works, because the misinterpretation of the string does not matter, as it is not converted during compilation. The compiler reads the string as Windows-1252 and emits the string as Windows-1252 (so it does not need to convert anything, and considers both to be "Ä’"). You interpret the source code and the data in the binary as UTF-8, so you consider both to be "Ē".

To have the compiler treat your source code as UTF-8, use the switch /utf-8.

BTW: The correct UTF-16 encoding (which is the encoding MSVC uses for wide character strings) to be observed in a wide-character string is not {0xc492, 0x0000}, but {0x0112, 0x0000}, because "Ē" is U+0112.

Upvotes: 4

Related Questions