Reputation: 2228
Given first byte(of a multi-byte character) and charset canonical name, how to determine byte-length of a character?
Best would be using ICU library.
Upvotes: 0
Views: 1264
Reputation: 2228
Use ucnv_getNextUChar from ICU library. The following code splits binary stream to chars and prints size of each character:
const char * utf8_strings[] = {"Samotność - cóż po ludziach, czym śpiewak dla ludzi"};
icu::ErrorCode err;
UConverter* conv = ucnv_open("UTF-8", err);
size_t len = strlen(utf8_strings[0]);
const char* curr = utf8_strings[0];
do {
const char* prev = curr;
ucnv_getNextUChar(conv, &curr, curr+len, err);
std::cout << prev[0] << " " << curr - prev << std::endl;
} while (curr < utf8_strings[0]+len);
Upvotes: 2
Reputation: 333
For most reasons, when designing a character set, there is always a way to determine byte length of a char by first character. So just say:
of 0xxx xxxx
110x xxxx
10xx xxxx
1110 xxxx
10xx xxxx
10xx xxxx
Upvotes: 1