Robinson
Robinson

Reputation: 10132

Convert UTF-16 (wchar_t on Windows) to UTF32

I have a string of characters given to me by a Windows API function (GetLocaleInfoEx with LOCALE_SLONGDATE) as wchar_t. Is it correct to say that the value returned from Windows will be UTF-16, and that therefore it may not be one wchar_t, one "printable character"?

To make writing my parser easier, is there a function I can use to convert from UTF-16 to UTF-32, where I'll be guaranteed (I assume), one array element represents one character?

Upvotes: 2

Views: 1462

Answers (2)

Harry Johnston
Harry Johnston

Reputation: 36348

Looking at the documentation for LOCALE_SLONGDATE it is stated that any characters other than the format pictures must be enclosed in single quotes. So in this particular case converting to UTF-32 should indeed solve your problem (but see proviso below).

By the same token, though, you don't need to. The only UTF-16 characters that don't represent a single UTF-32 character are the surrogate characters, none of which can be mistaken for a single quote. So to separate out the format pictures from the surrounding text, you just need to scan the UTF-16 string for single quotes. (The same is even true of UTF-8; the only byte that looks like a single quote is a single quote.)

Any surrogate pairs, combining characters, or other complications should always be safely tucked away inside the substrings thus delimited. Provided you never attempt to subdivide the substrings themselves, you should be safe.


Proviso: the documentation does not indicate whether it is permissible to combine a single quote mark with a combining character in a locale, and if so, how it will be interpreted. I interpret that as meaning that such a combination is not allowed. In any case, it seems unlikely that Windows itself would go to the trouble of dealing with such an unnecessary complication. So it should be safe enough to ignore this case too, but YMMV.

Upvotes: 1

Nicol Bolas
Nicol Bolas

Reputation: 474436

where I'll be guaranteed (I assume), one array element represents one character?

That's not how Unicode works. One codepoint (an array element in UTF-32) does not necessarily map to a single visible character. Multiple codepoints can combine to form a character thanks to features like Unicode combining characters.

You have to do genuine Unicode analysis if you want to be able to know how many visible characters a Unicode string has.

Even with dates (particularly long-form dates as you asked for), you are not safe from such features. The locale can return arbitrary Unicode strings, so you have no way to know from just the number of codepoints how long a Unicode string is.

Upvotes: 3

Related Questions