Reputation: 5900
I have and std::string with utf-8 characters (some latin, some non-latin) in linux and mac.
As we know, utf-8 char size is not fixed, and some of the characters are not just 1 byte (like regular latin characters).
The question is how can I get the character in offset i?
It makes sense to use int32 data type to store the char, but how do I get that character?
For example:
std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong
It is important to point out that I do not know the size of character in offset i.
Upvotes: 0
Views: 384
Reputation: 8636
It's very simple.
First, you have to understand, you cant calculate the position without iterating the string (that's obvious fr var-length characters)
Second, you need to remember that in utf-8
characters can be 1-4 bytes and in case they occupy more than one byte, all trailing bytes have 10
significant bits set. So, you just count bytes, ignoring them if (byte_val & 0xC0) == 0x80
.
Unfortunately, I don't have compiler at my disposal right now, so please be kind to possible mistakes in the code:
int desired_index = 19;
int index = 0;
char* p = my_str.c_str();
while ( *p && index < desired_index ){
if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
index++;
p++;
}
// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
p++;
if ( *p ){
// here p points to your desired char
} else {
// we reached EOL while searching
}
Upvotes: 2