Do c++ string objects handle variable width UTF encodings?

Question

The reference material I've read isn't clear at all. Say I'm using std::wstring object my_string If I refer to the "character" at my_string[23], is that actually the 23rd UTF-16 codepoint in the string? Or is it just whatever data is at the offset 23*2 bytes from the start, even when some of the code points are in the extended-plane and require 32 bits.

It's nice working in Python or Java where the details are all taken care of. The details can also be mostly ignored when compiling on UNIX since strings are always 32 bit and never require encoding. But when I take c++ code I've written home and test it on a Windows laptop, things are confusing.

It doesn't help that nobody uses precise language. From my understanding there are three different kinds of "text data objects" that are often ambiguously referred to as "characters". There are fixed-size "data points", then there are Unicode "code points" that may have to be encoded with 2 or more data points (1-4 uint8 in UTF-8, 1-2 uint16 in UTF-16), then there are "actual characters" that can sometimes consist of pairs of code points.

Do c++ string objects handle variable width UTF encodings?

Answers (1)

Related Questions