Josh Weinstein
Josh Weinstein

Reputation: 2968

Determine the width in bytes of a utf-8 character

So, I am trying to determine the width, in bytes, of a utf-8 character, based on it's binary representation. And with that, count the number of characters, in a utf8 string. Below is my code.

#include <stdlib.h>
#include <stdio.h>

static const char* test1 = "发f";
static const char* test2 = "ด้ดีด้ดี";

unsigned utf8_char_size(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

unsigned utf8_count_chars(const unsigned char* data)
{
  unsigned total = 0;
  while(*data != 0) {
    unsigned char_width = utf8_char_size(*data);
    total++;
    data += char_width;
  }
  return total;
}

int main(void) {
  fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test1));
  fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test2));
  return 0;
}

The problem here is that, I get The count is 2 for the first test runs above. This makes sense for the first one, but with the second one, test2, with 4 thai letters, it prints 8, which is not correct.

I would like to know what my code is doing wrong, and further more, I would like to know given an array of unsigned char in C, how does one iterate through the bytes as utf-8 characters?

Upvotes: 2

Views: 720

Answers (1)

The code measures neither characters nor glyphs but code points. A character can be composed of multiple Unicode codepoints. In this case the Thai text has 8 code points.

Unicode strings are easier to inspect in Python than in C, so here's a small Python 3.6 demonstration using the built-in Unicode database:

>>> import unicodedata
>>> for i in 'ด้ดีด้ดี':
...     print(f'{ord(i):04X} {unicodedata.name(i)}')
... 
0E14 THAI CHARACTER DO DEK
0E49 THAI CHARACTER MAI THO
0E14 THAI CHARACTER DO DEK
0E35 THAI CHARACTER SARA II
0E14 THAI CHARACTER DO DEK
0E49 THAI CHARACTER MAI THO
0E14 THAI CHARACTER DO DEK
0E35 THAI CHARACTER SARA II

Upvotes: 4

Related Questions