TCS
TCS

Reputation: 5900

Get "char" of a multi-byte character in linux/mac

I have and std::string with utf-8 characters (some latin, some non-latin) in linux and mac.

As we know, utf-8 char size is not fixed, and some of the characters are not just 1 byte (like regular latin characters).

The question is how can I get the character in offset i?

It makes sense to use int32 data type to store the char, but how do I get that character?

For example:

std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong

It is important to point out that I do not know the size of character in offset i.

Upvotes: 0

Views: 384

Answers (1)

grapes
grapes

Reputation: 8636

It's very simple.

First, you have to understand, you cant calculate the position without iterating the string (that's obvious fr var-length characters)

Second, you need to remember that in utf-8 characters can be 1-4 bytes and in case they occupy more than one byte, all trailing bytes have 10 significant bits set. So, you just count bytes, ignoring them if (byte_val & 0xC0) == 0x80.

Unfortunately, I don't have compiler at my disposal right now, so please be kind to possible mistakes in the code:

int desired_index = 19;
int index = 0;
char* p = my_str.c_str(); 
while ( *p && index < desired_index ){
  if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
    index++;
  p++;
}

// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
  p++;

if ( *p ){
  // here p points to your desired char
} else {
  // we reached EOL while searching
}

Upvotes: 2

Related Questions