Reputation: 765
If we want to represent more characters than ASCII allows, we can use Unicode, which uses more bits than ASCII to represent some characters. One implementation of Unicode, UTF-8, uses “variable-width encoding” to represent characters: characters can be represented by either one, two, three, or four bytes.
Upvotes: 0
Views: 548
Reputation: 598029
As you say, a Unicode codepoint (what you call a character) can be represented in UTF-8 using 1..4 code units (8-bit bytes) each. The bit pattern of the first code unit tells you how many code units are used:
Codepoints U+0000..U+007F use 1 code unit, where the sole code unit has its high bit set to 0
.
Codepoints U+0080..U+07FF use 2 code units, where the 1st code unit has its high 3 bits set to 110
.
Codepoints U+0800..U+FFFF use 3 code units, where the 1st code unit has its high 4 bits set to 1110
.
Codepoints U+10000..U+10FFFF use 4 code units, where the 1st code unit has its high 5 bits set to 11110
.
Given the 1st code unit of a UTF-8 encoded codepoint, you can mask its bits with a logical AND
operator to determine which pattern is used, eg:
int32_t readUTF8Char(FILE *f)
{
uint8_t b;
if (fread(&b, 1, 1, f) != 1) {
// read error
return -1;
}
if ((b & 0x80) == 0)
{
// 1 byte, use b as-is ...
return b;
}
int32_t codePoint;
int num = 0;
if ((b & 0xE0) == 0xC0) {
// 2 bytes, read 1 more byte ...
codePoint = b & 0x1F;
num = 1;
}
else if ((b & 0xF0) == 0xE0) {
// 3 bytes, read 2 more bytes ...
codePoint = b & 0x0F;
num = 2;
}
else if ((b & 0xF8) == 0xF0) {
// 4 bytes, read 3 more bytes ...
codePoint = b & 0x07;
num = 3;
}
else {
// malformed...
return -1;
}
for(int i = 0; i < num; ++i) {
if (fread(&b, 1, 1, f) != 1) {
// read error
return -1;
}
if ((b & 0xC0) != 0x80) {
// malformed
return -1;
}
codePoint = (codePoint << 6) | (b & 0x3F);
}
return codePoint;
}
Upvotes: 1