Chris Jefferson
Chris Jefferson

Reputation: 7157

Does a `std::u8string` have to be UTF-8?

C++20 added char8_t, which is (I believe) designed to help support UTF-8 better.

String constants of the form u8"abc" are required by the standard to be valid UTF-8 in a char8_t[] array. These constants can also be turned into std::u8strings.

However, I can find nothing in the C++ standard which suggests that a std::u8string either must, or even should, contain a UTF-8 string. Is there in practice any difference between a std::string and std::u8string in terms of UTF-8 support?

Upvotes: 2

Views: 391

Answers (1)

Chronial
Chronial

Reputation: 70693

No, c++ does not require you to store valid utf8 in u8strings. From the compiler's perspective, std::u8string has exactly the same semantics as std::string.

But "in practice" you can expect functions taking a u8string argument to expect that string to be valid utf8. Even if they accept invalid utf8, they will definitely never expect your string to be latin1 encoded. The same definitely can't be said for std::string.

Upvotes: 2

Related Questions