Reputation: 9527

Get a count of codepoints in an ICU UnicodeString?

I have started with ICU library in C++.

UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(u8"\U0001F674"));
ucs = ucs.unescape();
size_t len = ucs.length();

However, len = 2. Why? I have added only one 4 byte character (https://unicode-table.com/en/1F674/). Is there a way, how to return correct length?

I expect the length to be 1, since there is only 1 codepoint. If I use

UnicodeString::fromUTF8(StringPiece(u8"\u06b5"));
ucs = ucs.unescape();
size_t len = ucs.length();

I get correct len = 1

Upvotes: 0

Answers (2)

Reputation: 6424

To answer the original question, in order to get the number of code points in a UnicodeString, use UnicodeString::countChar32.

-- Shane (from the ICU team)

Upvotes: 2

Reputation: 15172

UnicodeString uses UTF-16, not UTF-8.

In UTF-16, codepoint U+1F674 requires two 2-byte codeunits: 0xD83D 0xDE74. And codepoint U+06B5 requires only one 2-byte codeunit: 0x06B5.

Upvotes: 3