rullof
rullof

Reputation: 7434

Why did C++11 introduce the char16_t and char32_t types

Why did the C++11 Standard introduce the types char16_t and char32_t? Isn't 1 Byte enough to store characters? Is there any purpose for extending the size for characters type?

Upvotes: 5

Views: 2955

Answers (3)

Sebastian Redl
Sebastian Redl

Reputation: 72054

So after you've read Joel's article about Unicode, you should know about Unicode in general, but not in C++.

The problem with C++98 was that it didn't know about Unicode, really. (Except for the universal character reference escape syntax.) C++ just required the implementation to define a "basic source character set" (which is essentially meaningless, because it's about the encoding of the source file, and thus comes down to telling the compiler "this is it"), a "basic execution character set" (some set of characters representable by narrow strings, and an 8-bit (possibly multi-byte) encoding used to represent it at runtime, which has to include the most important characters in C++), and a "wide execution character set" (a superset of the basic set, and an encoding that uses wchar_t as its code unit to go along with it, with the requirement that a single wchar_t can represent any character from the set).

Nothing about actual values in these character sets.

So what happened?

Well, Microsoft switched to Unicode very early, back when it still had less than 2^16 characters. They implemented their entire NT operating system using UCS-2, which is the fixed-width 16-bit encoding of old Unicode versions. It made perfect sense for them to define their wide execution character set to be Unicode, make wchar_t 16 bits and use UCS-2 encoding. For the basic set, they chose "whatever the current ANSI codepage is", which made zero sense, but they pretty much inherited that. And because narrow string support was considered legacy, the Windows API is full of weird restrictions on that. We'll get to that.

Unix switched a little later, when it was already clear that 16 bits weren't enough. Faced with the choice of using 16-bit variable width encoding (UTF-16), a 32-bit fixed width encoding (UTF-32/UCS-4), or an 8-bit variable width encoding (UTF-8), they went with UTF-8, which also had the nice property that a lot of code written to handle ASCII and ISO-8859-* text didn't even need to be updated. For wchar_t, they chose 32 bits and UCS-4, so that they could represent every Unicode code point in a single unit.

Microsoft then upgraded everything they had to UTF-16 to handle the new Unicode characters (with some long-lingering bugs), and wchar_t remained 16 bits, because of backwards compatibility. Of course that meant that wchar_t could no longer represent every character from the wide set in a single unit, making the Microsoft compiler non-conformant, but nobody thought that was a big deal. It wasn't like some C++ standard APIs are totally reliant on that property. (Well, yes, codecvt is. Tough luck.)

But still, they thought UTF-16 was the way to go, and the narrow APIs remained the unloved stepchildren. UTF-8 didn't get supported. You cannot use UTF-8 with the narrow Windows API. You cannot make the Microsoft compiler use UTF-8 as the encoding for narrow string literals. They just didn't feel it was worth it.

The result: extreme pain when trying to write internationalized applications for both Unix and Windows. Unix plays well with UTF-8, Windows with UTF-16. It's ugly. And wchar_t has different meanings on different platforms.

char16_t and char32_t, as well as the new string literal prefixes u, U and u8, are an attempt to give the programmer reliable tools to work with encodings. Sure, you still have to either do weird compile-time switching for multi-platform code, or else decide on one encoding and do conversions in some wrapper layer, but at least you now have the right tools for the latter choice. Want to go the UTF-16 route? Use u and char16_t everywhere, converting to UTF-8 near system APIs as needed. Previously you couldn't do that at all in Unix environments. Want UTF-8? Use char and u8, converting near UTF-16 system APIs (and avoid the standard library I/O and string manipulation stuff, because Microsoft's version still doesn't support UTF-8). Previously you couldn't do that at all in Windows. And now you can even use UTF-32, converting everywhere, if you really want to. That, too, wasn't possible before in Windows.

So that's why these things are in C++11: to give you some tools to work around the horrible SNAFU around character encodings in cross-platform code in an at least somewhat predictable and reliable fashion.

Upvotes: 11

Remy Lebeau
Remy Lebeau

Reputation: 597670

1 byte has never been enough. There are hundreds of Ansi 8bit encodings in existence because people kept trying to stuff different languages into the confines of 8bit limitations, thus the same byte values have different meanings in different languages. Then Unicode came along to solve that problem, but it needed 16 bits to do it (UCS-2). Eventually, the needs of the world's languages exceeded 16bit, so UTF-8/16/32 encodings were created to extend the available values.

char16_t and char32_t (and their respective text prefixes), were created to handle UTF-16/32 in a uniform manner on all platforms. Originally, there was wchar_t, but it was created when Unicode was new, and its byte size was never standardized, even to this day. On some platforms, wchar_t is 16bit (UTF-16), whereas on other platforms it is 32bit (UTF-32) instead. This has caused plenty of interoperability issues over the years when exchanging Unicode data across platforms. char16_t and char32_t were finally introduced to have standardized sizes - 16bit and 32bit, respectively - and semantics on all platforms.

Upvotes: 8

pentadecagon
pentadecagon

Reputation: 4847

There are around 100000 characters (they call them code points) defined in Unicode. So in order to specify any one of them, 1 Byte is not enough. 1 Byte is just enough to enumerate the first 256 of them, which happen to be identical to ISO-8859-1. Two bytes are enough for the most important subset of Unicode, the so-called Basic Multilingual Plane, and many applications, like e.g. Java, settle on 16 bit characters for Unicode. If you want truly every single Unicode character, you have to go beyond that and allow 4 Bytes / 32 bit. And because different people have different needs, C++ allows different sizes here. And UTF-8 is a variable-size encoding rarely used within programs, because different characters have different length. To some extend this also applies to UTF-16, but in most cases you can safely ignore this issue with char16_t.

Upvotes: 0

Related Questions