Character length in bytes

Question

Given first byte(of a multi-byte character) and charset canonical name, how to determine byte-length of a character?

Best would be using ICU library.

Michal · Accepted Answer

Use ucnv_getNextUChar from ICU library. The following code splits binary stream to chars and prints size of each character:

const char * utf8_strings[] = {"Samotność - cóż po ludziach, czym śpiewak dla ludzi"};

icu::ErrorCode err;
UConverter* conv = ucnv_open("UTF-8", err);
size_t len = strlen(utf8_strings[0]);
const char* curr = utf8_strings[0]; 
do {
    const char* prev = curr;
    ucnv_getNextUChar(conv, &curr, curr+len, err);
    std::cout << prev[0] << "  " << curr - prev << std::endl;       
} while (curr < utf8_strings[0]+len);

Character length in bytes

Answers (2)

Related Questions