Reputation: 11567

Get length of multibyte UTF-8 sequence

I am parsing some UTF-8 text but am only interested in characters in the ASCII range, i.e., I can just skip multibyte sequences.

I can easily detect the beginning of a sequence because the sign bit is set, so the char value is < 0. But how can I tell how many bytes are in the sequence so I can skip over it?

I do not need to perform any validation, i.e., I can assume the input is valid UTF-8.

Upvotes: 3

Answers (2)

wholerabbit

Reputation: 11567

Although Deduplicator's answer is more appropriate to the specific purpose of skipping over multibyte sequences, if there is a need to get the length of each such character, pass the first byte to this function:

int getUTF8SequenceLength (unsigned char firstPoint) {
    firstPoint >>= 4;
    firstPoint &= 7;
    if (firstPoint == 4) return 2;
    return firstPoint - 3;
}

This returns the total length of the sequence, including the first byte. I'm using an unsigned char value as the firstPoint parameter here for clarity, but note this function will work exactly the same way if the parameter is a signed char.

To explain:

UTF-8 uses bits 5, 6, and 7 in the first byte of a sequence to indicate the remaining length. If all three are set, the sequence is 3 additional bytes. If only the first of these from the left (the 7th bit) is set, the sequence is 1 additional byte. If the first two from the left are set, the sequence is 2 additional bytes. Hence, we want to examine these three bits (the value here is just an example):
```
 11110111
  ^^^
```
The value is shifted down by 4 then AND'd with 7. This leaves only the 1st, 2nd, and 3rd bits from the right as the only possible ones set. The value of these bits are 1, 2, and 4 respectively.
```
00000111
     ^^^ 
```
If the value is now 4, we know only the first bit from the left (of the three we are considering) is set and can return 2.
After this, the value is either 7, meaning all three bits are set, so the sequence is 4 bytes in total, or 6, meaning the first two from the left are set so the sequence is 3 bytes in total.

This covers the range of valid Unicode characters expressed in UTF-8.

Upvotes: 5

Deduplicator

Reputation: 45694

Just strip out all bytes which are no valid ascii, don't try to get cute and interpret bytes >127 at all. This works as long as you don't have any combining sequences with base character in ascii range. For those you would need to interpret the codepoints themselves.

Upvotes: 5

Get length of multibyte UTF-8 sequence

Answers (2)

Related Questions