Michal
Michal

Reputation: 2228

Character length in bytes

Given first byte(of a multi-byte character) and charset canonical name, how to determine byte-length of a character?

Best would be using ICU library.

Upvotes: 0

Views: 1264

Answers (2)

Michal
Michal

Reputation: 2228

Use ucnv_getNextUChar from ICU library. The following code splits binary stream to chars and prints size of each character:

const char * utf8_strings[] = {"Samotność - cóż po ludziach, czym śpiewak dla ludzi"};

icu::ErrorCode err;
UConverter* conv = ucnv_open("UTF-8", err);
size_t len = strlen(utf8_strings[0]);
const char* curr = utf8_strings[0]; 
do {
    const char* prev = curr;
    ucnv_getNextUChar(conv, &curr, curr+len, err);
    std::cout << prev[0] << "  " << curr - prev << std::endl;       
} while (curr < utf8_strings[0]+len);

Upvotes: 2

kyriosli
kyriosli

Reputation: 333

For most reasons, when designing a character set, there is always a way to determine byte length of a char by first character. So just say:

  • If it was UTF-16, each char is in two bytes.
  • If it was UTF-8, there may be three situations:
    1. chars below 0x80 is in format of 0xxx xxxx
    2. chars above 0x80 and below 0x800 is in format of 110x xxxx 10xx xxxx
    3. chars above 0x800 is in format of 1110 xxxx 10xx xxxx 10xx xxxx
  • If it was GBK, you can tell whether there is another byte of the char code by detecting whether first byte of this char is larger than 0x7f.
  • For iso-latin-1 or something like this, there is always one byte.

Upvotes: 1

Related Questions