Reputation: 695
I am trimming a long std::string
to fit it in a text container using this code.
std::string AppDelegate::getTrimmedStringWithRange(std::string text, int range)
{
if (text.length() > range)
{
std::string str(text,0,range-3);
return str.append("...");
}
return text;
}
but in case of other languages like HINDI "हिन्दी"
the length of std::string
is wrong.
My question is how can i retrieve accurate length of the std::string in all test cases.
Thanks
Upvotes: 3
Views: 6507
Reputation: 73607
As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8. In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.
Solution 1
If you have many long strings, you can keep them in utf8. The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary. So count the number of such additional bytes, and substract this from the string length
cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;
Solution 2
If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes()
in the standard library to convert your string into a wstring. The length of the wstring should be what you expect.
wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;
Attention: wstring
on linux is based on 32 bits wchar_t
and one such wide char can contain all the unicode characeter set. So this is perfect. On windows however, wchar_t
is only 16 bits, so some characters might still require multi-word encoding. Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. So it should be ok also .
Upvotes: 3
Reputation: 100758
Assuming you're using UTF-8, you can convert your string to a simple (hah!) Unicode and count the characters. I grabbed this example from rosettacode.
#include <iostream>
#include <codecvt>
int main()
{
std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
std::cout << "Byte length: " << utf8.size() << '\n';
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}
Upvotes: 9
Reputation: 385385
The length of std::string
is not "wrong"; you've simply misunderstood what it means. A std::string
stores bytes, not "characters" in your chosen encoding. It gleefully has no knowledge of that layer. As such, the length of std::string
is the number of bytes it contains.
To count such "characters", you will need a library that supports analysis of your chosen encoding, whatever that is.
Only if your chosen encoding is ASCII-compatible can you just count the bytes and be done with it.
Upvotes: 7