KiYugadgeter
KiYugadgeter

Reputation: 4014

How to get correct length of std::u8string in C++?

How do I get correct length of std::u8string? (in C++20) I have tried following code that print incorrect value of length which may returns value of number of codepoint.

How I can get correct value which I expected 7 that number of character?

int main() {
    const char8_t* s = u8"HelloπŸ˜ƒπŸ˜ƒ";
    auto st = std::u8string(s);
    std::cout << st.size() << std::endl;
}

Upvotes: 2

Views: 2992

Answers (3)

Tom Honermann
Tom Honermann

Reputation: 2231

Other answers have already suggested ways to compute the number of code points if that is really what you need for your use case. I'm adding this answer to make the point that code point length is probably not what you want.

And actually, I'm not going to make the point myself. Instead, I'm just going to provide a link to an excellent blog post that explains the issues so that you can evaluate what information you actually need.

https://hsivonen.fi/string-length

Upvotes: 2

Richard Hodges
Richard Hodges

Reputation: 69892

A standard c++ answer is to transform the string from utf8 to utf32 and then check the size.

Alarmingly, std::wstring_convert is now deprecated as of c++17. I have no idea what the replacement will be.

#include <string>
#include <iostream>
#include <cstdlib>
#include <locale>
#include <codecvt>

auto convert(std::u8string input) -> std::u32string
{
    auto first = reinterpret_cast<const char*>(input.data());
    auto last = first + input.size();

    auto result = std::u32string();

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> ucs4conv;
    try
    {
        result = ucs4conv.from_bytes(first, last);
    }
    catch(const std::range_error& e) {
        last = first + ucs4conv.converted();
        std::clog << "UCS4 failed after consuming " << std::dec << std::distance(first, last) <<" characters:\n";
        result = ucs4conv.from_bytes(first, last);
    }

    return result;
}

int main() {
    const char8_t* s = u8"HelloπŸ˜ƒπŸ˜ƒ";
    auto st = std::u8string(s);
    std::cout << "bytes      : " << st.size() << std::endl;

    auto ws = convert(st);
    std::cout << "wide chars : " << ws.size() << std::endl;
}

expected output:

bytes      : 13
wide chars : 7

https://godbolt.org/z/Z0a6bb

Upvotes: 2

Fire Lancer
Fire Lancer

Reputation: 30145

A u8string is effectively a sequence of bytes as far as most C++ functions are concerned. As such size() gives you 13 (48 65 6c 6c 6f f0 9f 98 83 f0 9f 98 83). The "πŸ˜ƒ" ("SMILING FACE WITH OPEN MOUTH" U+1F603) being encoded as 4 elements f0 9f 98 83. You will see this with [i], substr, etc. as well.

Knowing that it is UTF-8, you can count the number of Unicode code points. You could use a u32string which is codepoints. I don't believe C++ has functions to do so directly on a u8string out of the box:

size_t count_codepoints(const std::u8string &str)
{
    size_t count = 0;
    for (auto &c : str)
        if ((c & 0b1100'0000) != 0b1000'0000) // Not a trailing byte
            ++count;
    return count;
}

However this is still maybe not what people think of as "number of character". This is because multiple codepoints might be used to represent a single visible character, the "combining characters". Some of these also have "precomposed" forms, and the order of the combining codepoints can vary, leading to the "normal forms" and issues with comparing Unicode strings. For example "Á" might be "LATIN CAPITAL LETTER A WITH ACUTE' (U+00C1)" which is UTF-8 C3 81, or it might have a normal "A" with a "COMBINING ACUTE ACCENT (U+0301)" which is two code points and 3 UTF-8 bytes 41 CC 81.

There are tables for each Unicode version from unicode.org that let you properly handle and convert the combining characters (and things like upper/lower case conversion) but they are pretty extensive and you would need to write some code to handle them. 3rd party libraries (I think Linux mostly uses ICU) or OS functions (Window's has a bunch of API's) also provide various utilities.

It's worth noting you can run into these issues in many other cases/languages not just C++. e.g. JavaScript, Java and .NET, along with the Windows C/C++ API (essentially wchar_t on Windows) use UTF-16 strings which has "surrogate pairs" for some codepoints with many functions actually counting UTF-16 elements, not codepoints.

Upvotes: 7

Related Questions