Reputation: 1057
I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t
and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.
Feel free to correct me where I'm wrong:
# encoding:...
metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14
) .u16"text"
and u32"text"
and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.u8"text"
for producing an UTF8-encoded string...
but does not even introduce either a proper UTF8 char type or string type (that's what char8_t
is intended to be in C++20?), so it's even uselesser than the above.char8_t
is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.<random>
) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.
Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).
char8_t
at the proper time when u8"text"
was introduced or otherwise delay introduction of u8
?c8"text"
introduced with char8_t
in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world"
.char8_t
intended to functionally be (closer to) an alias of unsigned char
or of char
? typedef std::basic_string<unsigned char> u8string
a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?re: char8_t
as unsigned char
, this is more or less what I'm looking at in terms of pseudocode:
// this is here basically only for type-distinctiveness
class char8_t {
unsigned char value;
public:
non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
operator unsigned char () const noexcept;
// implement all operators to mirror operations on unsigned char
};
// public adapter jic
friend unsigned char to_char (char8_t);
// note we're *not* using our new char-type here
namespace std {
typedef std::basic_string<unsigned char> u8string;
}
// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)
namespace std {
basic_istream<char8_t> u8cin;
basic_ostream<char8_t> u8cout;
}
// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed
std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);
// likely a macro; we'll see
#define u8c(ch) static_cast<char8_t>(ch)
// char8_t ch = u8c('x');
// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str); // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure
// end result:
// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization,
// to add the required constructor, but doing so is a PITA because
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";
I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.
Upvotes: 6
Views: 2044
Reputation: 2241
I'm the author of the P0482 and P1423 char8_t
papers.
Also that it's a mess.
I completely agree. SG16 is working to improve all things Unicode and text related, but we're having to start near ground level, so it is going to take a while.
If you haven't seen it yet, the repository linked below provides some utilities for writing code that will work in C++17 and C++20.
C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14).
This is correct, but not without precedent. IBM's xlC compiler supports a #pragma filetag
directive that behaves similarly to Python's encoding declaration. I started on a paper exploring this space and had hoped to submit it for the Prague meeting, but did not complete it in time. I expect to submit it for the Varna meeting (in June).
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
Correct, and this technically remained true for char16_t
and char32_t
string literals until C++20 and the adoption of P1041. Note though that there is no reparsing going on. In translation phase 1, the source code contents are converted to the compiler's internal encoding and then in translation phase 5, character and string literals are converted to the encoding of the appropriate execution character set.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
Correct. P1629 is one of the more significant changes we're hoping to complete for C++23. The goal is to provide text encoders, decoders, and transcoders that facilitate working with text at the code unit and code point levels. We would also provide support for enumerating grapheme clusters.
C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Correct. The goal for C++20 was to 1) enable differentiating "text"
and u8"text"
in the type system, 2) enable separating locale dependent and UTF-8 text (with enforcement from the type system), 3) ensure use of an unsigned type for UTF-8 code units, and 4) avoid the char
type aliasing penalty. That was all we had time to get done for C++20 (standardization is not a rapid process).
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Correct, char8_t
was proposed as a breaking change; something not to be taken lightly. In this case, it was deemed acceptable because 1) code searches found little use of u8
character and string literals, 2) the options for addressing backward compatibility concerns as discussed in P1423 were considered adequate, and 3) a non-breaking proposal would have added long term baggage to the language for little gain.
Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Correct. We'll be working to improve this situation, but it will take time. codecvt
has not been dropped (yet); the <codecvt>
header and various UTF converters were deprecated in C++17. std::codecvt
suffers from performance and usability issues, so is not considered something we can continue to build on. We believe P1629 is a superior direction.
Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
I asked one of the C++ committee members who was involved in that original effort. He told me that he asked the people working on Unicode at the time if a new type should be added and the response was, "eh, we don't need it".
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
A different prefix was considered and at one point I briefly favored that approach. However, as mentioned earlier, that would have left us with two ways of spelling UTF-8 literals and related historical baggage. In the long run, it was felt that a breaking change, so long as we had reasonable means to mitigate the breakage, offered more benefits.
With regard to that simple test case, take a minute to think about what that code should do. Then go read this: What is the printf() formatting character for char8_t *?.
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
char8_t
is intentionally and explicitly not an alias (because that has negative performance implications) but is specified to have the same underlying representation as unsigned char
. The reason for unsigned char
over char
is to avoid expressions like u8'\x80' < 0
ever evaluating to true (which may or may not be the case with char
today).
If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
I won't comment on whether this approach is a good idea or not, but it has been done before. For example, EASTL has such a typedef (That project also provides a definition of char8_t
if the native type isn't available)
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 for storage only?
I don't think there is one right answer to this question. I've seen projects use unsigned char
or provide a char8_t
like type via a class.
With regard to your pseudocode, some tweaks to the code in the previously mentioned char8_t-remediation repository to provide unsigned char
types instead of char
should enable code like the following to work. See the definitions of the _as_char
user-defined literals and U8
macro.
typedef std::basic_string<unsigned char> u8string;
u8string u8s(U8("text"));
Upvotes: 11