Haroon
Haroon

Reputation: 695

How to get the accurate length of a std::string?

I am trimming a long std::string to fit it in a text container using this code.

std::string AppDelegate::getTrimmedStringWithRange(std::string text, int range)
{
    if (text.length() > range)
    {
        std::string str(text,0,range-3);
        return str.append("...");
    }
    return text;
}

but in case of other languages like HINDI "हिन्दी" the length of std::string is wrong.

My question is how can i retrieve accurate length of the std::string in all test cases.

Thanks

Upvotes: 3

Views: 6507

Answers (3)

Christophe
Christophe

Reputation: 73607

As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8. In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.

Solution 1

If you have many long strings, you can keep them in utf8. The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary. So count the number of such additional bytes, and substract this from the string length

cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;

Solution 2

If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes() in the standard library to convert your string into a wstring. The length of the wstring should be what you expect.

wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;

Attention: wstring on linux is based on 32 bits wchar_t and one such wide char can contain all the unicode characeter set. So this is perfect. On windows however, wchar_t is only 16 bits, so some characters might still require multi-word encoding. Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. So it should be ok also .

Upvotes: 3

Ferruccio
Ferruccio

Reputation: 100758

Assuming you're using UTF-8, you can convert your string to a simple (hah!) Unicode and count the characters. I grabbed this example from rosettacode.

#include <iostream>
#include <codecvt>
int main()
{
    std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
    std::cout << "Byte length: " << utf8.size() << '\n';
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}

Upvotes: 9

Lightness Races in Orbit
Lightness Races in Orbit

Reputation: 385385

The length of std::string is not "wrong"; you've simply misunderstood what it means. A std::string stores bytes, not "characters" in your chosen encoding. It gleefully has no knowledge of that layer. As such, the length of std::string is the number of bytes it contains.

To count such "characters", you will need a library that supports analysis of your chosen encoding, whatever that is.

Only if your chosen encoding is ASCII-compatible can you just count the bytes and be done with it.

Upvotes: 7

Related Questions