Delan Azabani
Delan Azabani

Reputation: 81412

preventing overlong forms when parsing UTF-8

I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?

Pseudocode:

let w = 0, // the number of continuation bytes pending
    c = 0, // the currently being constructed codepoint
    b,     // the current byte from the source stream
    valid(c) = (
        (c < 0x110000) &&
        ((c & 0xFFFFF800) != 0xD800) &&
        ((c < 0xFDD0) || (c > 0xFDEF)) &&
        ((c & 0xFFFE) != 0xFFFE))
for each b:
    if b < 0x80:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
            w = 0
        append U+b to output string
    else if b < 0xc0:
        if w == 0: // unwanted continuation byte
            append U+FFFD to output string
        else:
            c |= (b & 0x3f) << (--w * 6)
            if w == 0: // done
                if valid(c):
                    append U+c to output string
    else if b < 0xfe:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
        w = (b < 0xe0) ? 1 :
            (b < 0xf0) ? 2 :
            (b < 0xf8) ? 3 :
            (b < 0xfc) ? 4 : 5;
        c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
    else:
        append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
    append U+FFFD to output string

Upvotes: 2

Views: 754

Answers (3)

Gabe
Gabe

Reputation: 86768

I had initially thought that if at any point in time after decoding a byte, w > 0 && c == 0, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes:

if c < 0x80 && len > 1 ||
   c < 0x800 && len > 2 ||
   c < 0x10000 && len > 3 ||
   len > 4:
 append U+FFFD to output string

Upvotes: 1

xanatos
xanatos

Reputation: 111890

If you save the number of bytes you'll need (so you save a second copy of the initial value of w), you can compare the UTF32 value of the codepoint (I think you are calling it c) with the number of bytes that were used to encode it. You know that:

U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes

(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )

Just as a sidenote: I think there are some logic errors/formatting errors. if b < 0x80 if w > 0 what happens if w = 0? (so for example if you are decoding A)? And shouldn't you reset c when you find an illegal codepoint?

Upvotes: 5

Mat
Mat

Reputation: 206831

Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.

If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.

If you save the original w and compare it to these values, you'll be able to tell if the encoding was proper or overlong.

Upvotes: 1

Related Questions