Situation of char8_t / UTF8 chars pre-C++17 and poor-man-ing it?

Question

I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.

Feel free to correct me where I'm wrong:

C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14) .
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.

Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.

Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).

Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 *for storage only*?

re: char8_t as unsigned char, this is more or less what I'm looking at in terms of pseudocode:

// this is here basically only for type-distinctiveness
class char8_t {
  unsigned char value;

  public:
  non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
  operator unsigned char () const noexcept;
  // implement all operators to mirror operations on unsigned char
};

// public adapter jic
friend unsigned char to_char (char8_t);

// note we're *not* using our new char-type here
namespace std {
  typedef std::basic_string u8string;
}

// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)

namespace std {
  basic_istream u8cin;
  basic_ostream u8cout;
}

// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed

std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);

// likely a macro; we'll see
#define u8c(ch) static_cast(ch)
// char8_t ch = u8c('x');

// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str);    // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure

// end result:

// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization, 
// to add the required constructor, but doing so is a PITA because 
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";

I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.

Tom Honermann · Accepted Answer

I'm the author of the P0482 and P1423 char8_t papers.

Also that it's a mess.

I completely agree. SG16 is working to improve all things Unicode and text related, but we're having to start near ground level, so it is going to take a while.

If you haven't seen it yet, the repository linked below provides some utilities for writing code that will work in C++17 and C++20.

https://github.com/tahonermann/char8_t-remediation

C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14).

This is correct, but not without precedent. IBM's xlC compiler supports a #pragma filetag directive that behaves similarly to Python's encoding declaration. I started on a paper exploring this space and had hoped to submit it for the Prague meeting, but did not complete it in time. I expect to submit it for the Varna meeting (in June).

Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.

Correct, and this technically remained true for char16_t and char32_t string literals until C++20 and the adoption of P1041. Note though that there is no reparsing going on. In translation phase 1, the source code contents are converted to the compiler's internal encoding and then in translation phase 5, character and string literals are converted to the encoding of the appropriate execution character set.

C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.

Correct. P1629 is one of the more significant changes we're hoping to complete for C++23. The goal is to provide text encoders, decoders, and transcoders that facilitate working with text at the code unit and code point levels. We would also provide support for enumerating grapheme clusters.

C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.

Correct. The goal for C++20 was to 1) enable differentiating "text" and u8"text" in the type system, 2) enable separating locale dependent and UTF-8 text (with enforcement from the type system), 3) ensure use of an unsigned type for UTF-8 code units, and 4) avoid the char type aliasing penalty. That was all we had time to get done for C++20 (standardization is not a rapid process).

Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.

Correct, char8_t was proposed as a breaking change; something not to be taken lightly. In this case, it was deemed acceptable because 1) code searches found little use of u8 character and string literals, 2) the options for addressing backward compatibility concerns as discussed in P1423 were considered adequate, and 3) a non-breaking proposal would have added long term baggage to the language for little gain.

Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.

Correct. We'll be working to improve this situation, but it will take time. codecvt has not been dropped (yet); the header and various UTF converters were deprecated in C++17. std::codecvt suffers from performance and usability issues, so is not considered something we can continue to build on. We believe P1629 is a superior direction.

Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?

I asked one of the C++ committee members who was involved in that original effort. He told me that he asked the people working on Unicode at the time if a new type should be added and the response was, "eh, we don't need it".

Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".

A different prefix was considered and at one point I briefly favored that approach. However, as mentioned earlier, that would have left us with two ways of spelling UTF-8 literals and related historical baggage. In the long run, it was felt that a breaking change, so long as we had reasonable means to mitigate the breakage, offered more benefits.

With regard to that simple test case, take a minute to think about what that code should do. Then go read this: What is the printf() formatting character for char8_t *?.

Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?

char8_t is intentionally and explicitly not an alias (because that has negative performance implications) but is specified to have the same underlying representation as unsigned char. The reason for unsigned char over char is to avoid expressions like u8'\x80' < 0 ever evaluating to true (which may or may not be the case with char today).

If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?

I won't comment on whether this approach is a good idea or not, but it has been done before. For example, EASTL has such a typedef (That project also provides a definition of char8_t if the native type isn't available)

What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 for storage only?

I don't think there is one right answer to this question. I've seen projects use unsigned char or provide a char8_t like type via a class.

With regard to your pseudocode, some tweaks to the code in the previously mentioned char8_t-remediation repository to provide unsigned char types instead of char should enable code like the following to work. See the definitions of the _as_char user-defined literals and U8 macro.

typedef std::basic_string u8string;
u8string u8s(U8("text"));

Situation of char8_t / UTF8 chars pre-C++17 and poor-man-ing it?

Answers (1)

Related Questions