Reputation: 31

How to use UTF-8 and Unicode in C++? How big is C++20 char8_t?

Let's say I want to store a single (not as in std::string) Unicode character in C++, how would I do that? char8_t was introduced in C++20, but it seems it's just a typedef of unsigned char, only storing up to 1 byte of information. Some characters (especially more exotic ones like emoji) can take up to 4 bytes at once.

Example of code that doesn't work:

char8_t smth = "😀";

Interestingly this WILL work although sizeof() says it's 8 bytes big, which I somehow doubt.

const char* smth = "😀";

Upvotes: 3

Answers (1)

Victor Drouin

Reputation: 627

Unicode vs UTF-8 vs UTF-32 vs char8_t vs char32_t

Unicode is a standard representation of characters based on a 32-bits unsigned integer representation (a code point). By abuse of language we also say "Unicode" to talk about the a code point. For instance Unicode (code point) of 😀 is 0x1F600.

UTF-32 is a trivial encoding of Unicode code points into 4 bytes (or 32 bits). It is trivial because you can just store the code point which is a 32-bits unsigned integer.

UTF-8 is an encoding format of Unicode code points able to store them in 1 to 4 blocks of 8 bits of data. This is possible because Unicode code points don't use all 32 bits and it is therefor possible to represent most frequently used character (~ASCII) in 1 byte (or 8 bits) and less frequently used ones in 2 to 4 bytes.

char8_t is roughly an unsigned integer of 8 bits. I say "roughly" for (at least) 2 reasons: first c++ standard imposes that it is at least 8 bits in size but it could be more if compiler/system decide so, and second it is considered to be its unique type and isn't exactly the same as std::uint8_t (though casting from one to the other is trivial).

char32_t is similar to a char8_t except is allows for the use of 32 bits (so it's roughly comparable to an std::uint32_t), which is convenient because you can use it to store exactly one Unicode code point.

The case of char(8_t) const*

In C++ you should be careful when using c-string (char(8_t) const*). They do not behave like an object but like a pointer thus querying its size will return the one of the pointer (8 on 64bit processors). It can seem even more stupid with the following code:

char8_t const* str = u"Hello";
sizeof(str); // == 8
sizeof(u"Hello"); // == 6 (5 letters + trailing 0x00)

Using the appropriate string literal

Be careful when using char (or char const* or std::string). It is not made to store UTF-8 encoded strings but Extended ASCII. Thus your compiler will not know what your are trying to write and likely not do what you expect.

char c0 = '😀';             // = '?' on Visual Studio (with 3 warnings)
char8_t c1 = u8'😀';        // Compilation error: trying to store 4 char8_t in 1
char32_t c2 = U'😀';        // = 😀 (or 128512)

char const* s0 = "😀";      // = "??" on Visual Studio (with 1 warning)
char8_t const* s1 = "😀";   // = "😀" stored on 4 bytes (0xf0, 0x9f, 0x98, 0x80), or "ðŸ˜€"
char32_t const* s2 = U"😀"; // = "😀" stored like the 4 bytes unsigned integer 128512

sizeof("😀");               // = 3: 2 bytes for 😀 (not sure why) + 1 byte for 0x00
sizeof(u8"😀");             // = 5: 4 bytes for 😀 + 1 byte for 0x00
sizeof(U"😀");              // = 8: 4 bytes for 😀 + 4 bytes for 0x00

Storing One Unicode / Unicode character

As stated by Igor, storing 1 Unicode character can be done through the use of char32_t. However if you want to store the code itself (the integer) you can store an std::uint32_t. These 2 representations are different both for the compiler and semantically so be aware! Most of the time using char32_t will be more explicit and less prone to error.

char32_t c = U'😀';
std::uint32_t u = 0x1F600u; // it's funny because 'u' stands for unsigned here..

Storing a String of Unicode characters

However if you want to store a string of Unicode characters you have multiple options. What you want to know first is what are the constraints of your program, what other system it interacts with, etc..

Using char32_t

If you need to constantly add/remove characters or check the Unicode (for instance if you need to draw the character on screen from a font) and -it is quite important- if you don't have strong memory constraints + you don't interface with an (older) library using normal strings to store UTF-8 characters, you can go with the UTF-32 representation through the use of char32_t:

std::size_t size = sizeof(U"😀Ö"); // = 12 -> 4 bytes for each character including trailing 0x00

char32_t const* cString = U"😀Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u32string string{ U"😀Ö" }; // .size() = 2

std::u32string_view stringView{ U"😀Ö" }; // .size() = 2

Using char8_t

If you are limited by memory and can't afford to use 32 bits of storage for each Unicode (knowing that in most cases it will be ASCII character that could be represented by only 8 bits in UTF-8 encoding) or if you need to interface with libraries that (for instance) use char const*/std::string to store UTF-8 encoded characters, you can decide to store your string encoded in UTF-8 through the use of char8_t:

std::size_t size = sizeof(u8"😀Ö");
// = 7 -> 4 bytes for the emoji (they are pretty uncommon so UTF-8 encodes them on 4 bytes)
//   + 2 bytes for the "Ö" (not as uncommon but not a -very common- ASCII)
//   + 1 byte for the trailing 0x00

char8_t const* cString = u8"😀Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u8string string{ u8"😀Ö" }; // .size() = 6 (string's size method doesn't count the 0x00)

std::u8string_view stringView{ u8"😀Ö" }; // .size() = 6

The trick with the use of char8_t is that technically your computer doesn't know that is is encoded in UTF-8 (well, you compiler will know and encode "😀Ö" for you), it only knows that you are storing 8-bit long things representing character, hence why it doesn't return you "2" when you ask for the size of these strings. If you need to know how many Unicode that represents (or how many characters your would have to draw on screen), you need to decode this encoding. It probably exists some fancy library that will do it for you, but here is what I personally use (I wrote that based on UTF-8 specifications):

// How many char8_t of this string you need to read to get 1 Unicode. The trick here 
// is that it can be done using only the first char8_t of the string because of how
// UTF-8 encoding works. However this won't check for following bytes that could be
// erroneous.
constexpr std::size_t code_size(std::u8string_view a_string) noexcept
{
    auto const h0 = a_string[0] & 0b11110000;
    return h0 < 0b10000000 ? 1 : (h0 < 0b11100000 ? 2 : (h0 < 0b11110000 ? 3 : 4));
}

// How many char8_t you need to add to a string to encode this Unicode with UTF-8.
constexpr std::size_t code_size(char32_t const a_code) noexcept
{
    return a_code < 0x007f ? 1 : (a_code < 0x07ff ? 2 : (a_code < 0xffff ? 3 : 4));
}

// How many Unicode characters are stored in this UTF-8 encoded string.
constexpr std::size_t string_size(std::u8string_view a_string) noexcept
{
    auto size = 0ull;
    while (!a_string.empty())
    {
        auto const codeSize = code_size(a_string);
        if (codeSize > a_string.size())
        {
            return -1; // Error: this is not a valid UTF-8 encoded string.
        }
        size += codeSize;
        a_string = a_string.substr(codeSize);
    }
}

// Append the UTF-8 encoding of a code to an u8string.
template<typename TAllocator>
constexpr std::size_t write(
    char32_t a_code,
    std::basic_string<char8_t, std::char_traits<char8_t>, TAllocator>& a_output) noexcept
{
    if (a_code <= 0x007f)
    {
        a_output += static_cast<char8_t>(a_code);
        return 1;
    }
    else if (a_code <= 0x07ff)
    {
        a_output += static_cast<char8_t>(0b11000000 | ((a_code >> 6) & 0b00011111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 2;
    }
    else if (a_code <= 0xffff)
    {
        a_output += static_cast<char8_t>(0b11100000 | ((a_code >> 12) & 0b00001111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 3;
    }
    else
    {
        a_output += static_cast<char8_t>(0b11110000 | ((a_code >> 18) & 0b00000111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 12) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 4;
    }
}

// Read an Unicode from an UTF-8 encoded string view, effectively decreasing its size.
constexpr char32_t read(std::u8string_view& a_string)
{
    if (a_string.empty())
    {
        return 0x0000; // Null character
    }

    auto const codeSize = code_size(a_string);
    if (codeSize > a_string.size())
    {
        return 0xffff; // Invalid unicode
    }

    char8_t mask0 = codeSize < 2 ?
        0b1111111 : (codeSize < 3 ? 0b11111 : (codeSize < 4 ? 0b1111 : 0b111));
    char32_t unicode = mask0 & a_string[0];
    a_string = a_string.substr(1);

    constexpr char8_t mask = 0b00111111;
    for (auto i = 1u; i < codeSize; ++i)
    {
        if ((a_string[0] & ~mask) != 0b10000000)
        {
            return 0xffff; // Invalid unicode
        }
        unicode = (unicode << 6) | (mask & a_string[0]);
        a_string = a_string.substr(1);
    }
    
    return unicode;
}

Upvotes: 4

How to use UTF-8 and Unicode in C++? How big is C++20 char8_t?

Answers (1)

Related Questions