炸鱼薯条德里克
炸鱼薯条德里克

Reputation: 999

What does the index of an UTF-8 encoding error indicate?

fn main() {
    let ud7ff = String::from_utf8(vec![0xed, 0x9f, 0xbf]);
    if ud7ff.is_ok() {
        println!("U+D7FF OK! Get {}", ud7ff.unwrap());
    } else {
        println!("U+D7FF Fail!");
    }

    let ud800 = String::from_utf8(vec![0xed, 0xa0, 0x80]);
    if ud800.is_ok() {
        println!("U+D800 OK! Get {}", ud800.unwrap());
    } else {
        println!("{}", ud800.unwrap_err());
    }
}

Running this code prints invalid utf-8 sequence of 1 bytes from index 0. I understand it's an encoding error, but why does the error say index 0? Shouldn't it be index 1 because index 0 is the same in both cases?

Upvotes: 4

Views: 2764

Answers (1)

DK.
DK.

Reputation: 58975

That's because Rust is reporting the byte index which begins an invalid code point sequence, not any specific byte within that sequence. After all, the error could be the second byte, or maybe the first byte was corrupted? Or maybe the leading byte of the sequence went missing.

Rust doesn't, and can't, know, so it just reports the most convenient position: the first offset at which it couldn't decode a complete code point.

Upvotes: 6

Related Questions