MarshallBS
MarshallBS

Reputation: 96

Do c++ string objects handle variable width UTF encodings?

The reference material I've read isn't clear at all. Say I'm using std::wstring object my_string If I refer to the "character" at my_string[23], is that actually the 23rd UTF-16 codepoint in the string? Or is it just whatever data is at the offset 23*2 bytes from the start, even when some of the code points are in the extended-plane and require 32 bits.

It's nice working in Python or Java where the details are all taken care of. The details can also be mostly ignored when compiling on UNIX since strings are always 32 bit and never require encoding. But when I take c++ code I've written home and test it on a Windows laptop, things are confusing.

It doesn't help that nobody uses precise language. From my understanding there are three different kinds of "text data objects" that are often ambiguously referred to as "characters". There are fixed-size "data points", then there are Unicode "code points" that may have to be encoded with 2 or more data points (1-4 uint8 in UTF-8, 1-2 uint16 in UTF-16), then there are "actual characters" that can sometimes consist of pairs of code points.

Upvotes: 1

Views: 122

Answers (1)

Anders
Anders

Reputation: 101636

It is the naive raw byte offset. Since on Windows (with all C libraries I know of) wchar_t is Unicode, the actual thing it returns is the Unicode code unit and that is just a fancy way of saying wchar_t.

Windows NT 3 & 4 was just UCS2 and you only had to deal with Unicode code-points. On 2000 and later on the Windows side of things the native string changed to UTF-16 and surrogate pairs came in and "ruined the day".

Your understanding is correct; a grapheme cluster (what your average human calls a character) is made up of a base code-point and optionally one or several code-points for combining marks. A code-point is made up of one or several code units (1 or 2 for UTF-16, 1-6 for UTF-8).

This means that even if you are working on a platform where wchar_t==UCS4/UTF-32 you cannot just cut/split a string at arbitrary positions because you might end up chopping the combining marks away from the base character. The problem is even more destructive for a script like Hangul (Korean) because syllables are written in blocks of 2 or 3 parts (jamo).

Windows provides functions like CharNextW to help you walk strings by skipping over "attached" combining marks. I don't know if the newer u16string uses it or if it is just as naive.

Upvotes: 1

Related Questions