When reading, as via fread, a text file encoded as UTF-8, how can you determine how many bytes a character will take?

Question

If we want to represent more characters than ASCII allows, we can use Unicode, which uses more bits than ASCII to represent some characters. One implementation of Unicode, UTF-8, uses “variable-width encoding” to represent characters: characters can be represented by either one, two, three, or four bytes.

Remy Lebeau · Accepted Answer

As you say, a Unicode codepoint (what you call a character) can be represented in UTF-8 using 1..4 code units (8-bit bytes) each. The bit pattern of the first code unit tells you how many code units are used:

Codepoints U+0000..U+007F use 1 code unit, where the sole code unit has its high bit set to 0.

Codepoints U+0080..U+07FF use 2 code units, where the 1st code unit has its high 3 bits set to 110.

Codepoints U+0800..U+FFFF use 3 code units, where the 1st code unit has its high 4 bits set to 1110.

Codepoints U+10000..U+10FFFF use 4 code units, where the 1st code unit has its high 5 bits set to 11110.

Given the 1st code unit of a UTF-8 encoded codepoint, you can mask its bits with a logical AND operator to determine which pattern is used, eg:

int32_t readUTF8Char(FILE *f)
{
    uint8_t b;

    if (fread(&b, 1, 1, f) != 1) {
        // read error
        return -1;
    }

    if ((b & 0x80) == 0)
    {
        // 1 byte, use b as-is ...
        return b;
    }

    int32_t codePoint;
    int num = 0;

    if ((b & 0xE0) == 0xC0) {
        // 2 bytes, read 1 more byte ...
        codePoint = b & 0x1F;
        num = 1;
    }
    else if ((b & 0xF0) == 0xE0) {
        // 3 bytes, read 2 more bytes ...
        codePoint = b & 0x0F;
        num = 2;
    }
    else if ((b & 0xF8) == 0xF0) {
        // 4 bytes, read 3 more bytes ...
        codePoint = b & 0x07;
        num = 3;
    }
    else {
        // malformed...
        return -1;
    }

    for(int i = 0; i < num; ++i) {
        if (fread(&b, 1, 1, f) != 1) {
            // read error
            return -1;
        }
        if ((b & 0xC0) != 0x80) {
            // malformed
            return -1;
        }
        codePoint = (codePoint << 6) | (b & 0x3F);
    }
    return codePoint;
}

When reading, as via fread, a text file encoded as UTF-8, how can you determine how many bytes a character will take?

Answers (1)

Related Questions