Reputation: 81412
I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?
Pseudocode:
let w = 0, // the number of continuation bytes pending
c = 0, // the currently being constructed codepoint
b, // the current byte from the source stream
valid(c) = (
(c < 0x110000) &&
((c & 0xFFFFF800) != 0xD800) &&
((c < 0xFDD0) || (c > 0xFDEF)) &&
((c & 0xFFFE) != 0xFFFE))
for each b:
if b < 0x80:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = 0
append U+b to output string
else if b < 0xc0:
if w == 0: // unwanted continuation byte
append U+FFFD to output string
else:
c |= (b & 0x3f) << (--w * 6)
if w == 0: // done
if valid(c):
append U+c to output string
else if b < 0xfe:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = (b < 0xe0) ? 1 :
(b < 0xf0) ? 2 :
(b < 0xf8) ? 3 :
(b < 0xfc) ? 4 : 5;
c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
else:
append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
append U+FFFD to output string
Upvotes: 2
Views: 754
Reputation: 86768
I had initially thought that if at any point in time after decoding a byte, w > 0 && c == 0
, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes:
if c < 0x80 && len > 1 ||
c < 0x800 && len > 2 ||
c < 0x10000 && len > 3 ||
len > 4:
append U+FFFD to output string
Upvotes: 1
Reputation: 111890
If you save the number of bytes you'll need (so you save a second copy of the initial value of w
), you can compare the UTF32 value of the codepoint (I think you are calling it c
) with the number of bytes that were used to encode it. You know that:
U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes
(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )
Just as a sidenote: I think there are some logic errors/formatting errors. if b < 0x80 if w > 0
what happens if w = 0? (so for example if you are decoding A
)? And shouldn't you reset c
when you find an illegal codepoint?
Upvotes: 5
Reputation: 206831
Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.
If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.
If you save the original w
and compare it to these values, you'll be able to tell if the encoding was proper or overlong.
Upvotes: 1