ABu
ABu

Reputation: 12279

C++ portable way to getting the value representation of a u8 string literal

Let's consider:

char const str[] = u8"ñ";
auto const* u8_code_units = reinterpret_cast<unsigned char*>(str); 
// using u8_code_units elements

Is that fully portable and C++ standard compliant? Or there's some clause that says that it's undefined behaviour or depends on any unspecified value? I know that unsigned char and char shall have the same alignment requirements and reinterpret_cast<T*>(v) equals in that case to static_cast<T*>(static_cast<void*>(v)), so, I think it is completly safe and portable to use it but I'm not sure.

Upvotes: 0

Views: 165

Answers (1)

Nicol Bolas
Nicol Bolas

Reputation: 473916

Is that fully portable and C++ standard compliant?

Kinda, but not for the reason you think.

See, you have to actually save that file to disk in some format. Which means your compiler has to be able to read that same format. And what text formats a compiler supports is implementation-defined.

However, if your compiler supports the format you saved it in, and that format can save Unicode-encoded characters, then your compiler will do the right thing here.

Even the reinterpret_cast is fine, because the compiler requires that char arrays can be accessed through unsigned char arrays, even if the platform's char is signed. And the standard explicitly requires that, when reading a UTF-8 formatted char array through an unsigned char, you will get the bits you expect from the UTF-8 formatting.

Note however:

I know that unsigned char and char shall have the same alignment requirements and reinterpret_cast(v) equals in that case to static_cast(static_cast(v)),

That would not be enough to protect you. It works because the standard explicitly says that it works in this particular case, not because of alignment requirements and such. char and unsigned char have exceptions to the rules on aliasing to allow this; alignment has nothing to do with it.

Upvotes: 2

Related Questions