Reputation: 9527
I have started with ICU library in C++.
UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(u8"\U0001F674"));
ucs = ucs.unescape();
size_t len = ucs.length();
However, len = 2
. Why? I have added only one 4 byte character (https://unicode-table.com/en/1F674/). Is there a way, how to return correct length?
I expect the length to be 1, since there is only 1 codepoint. If I use
UnicodeString::fromUTF8(StringPiece(u8"\u06b5"));
ucs = ucs.unescape();
size_t len = ucs.length();
I get correct len = 1
Upvotes: 0
Views: 1068
Reputation: 6414
To answer the original question, in order to get the number of code points in a UnicodeString, use UnicodeString::countChar32.
-- Shane (from the ICU team)
Upvotes: 1
Reputation: 15164
UnicodeString
uses UTF-16, not UTF-8.
In UTF-16, codepoint U+1F674
requires two 2-byte codeunits: 0xD83D 0xDE74
. And codepoint U+06B5
requires only one 2-byte codeunit: 0x06B5
.
Upvotes: 4