Nivaldo T
Nivaldo T

Reputation: 103

How to properly read binary encodings from a text file in Rust?

I have a YAML file with test cases for encoding and decoding elements. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original element. For example, the VarInt test cases are:

examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1

The encodings (left-hand side) for the first three examples work correctly when read as strings (which are automatically interpreted as UTF-8 in Rust).

However, the fourth example (200) and the subsequent ones don't yield the correct results. Using the encoding for 200 ("\u00c8\u0001") as an example:

Reading the encoding as a UTF-8 string (incorrect):

use bytes::{Buf, BufMut};

let encoding_as_utf8_string = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Encoding as a UTF-8 string: {:?}", encoding_as_utf8_string);

let mut utf8_bytes: &[u8] = encoding_as_utf8_string.as_bytes();
// [195, 136, 1] (Incorrect)
println!("Bytes obtained from the encoding when read as a UTF-8 string: {:?}", utf8_bytes);

Reading the encoding as a byte array (correct):

use bytes::{Buf, BufMut};

let string_from_byte_array: String;
unsafe {
    let encoding_as_byte_array: &[u8; 2] = b"\xc8\x01";
    string_from_byte_array = String::from_utf8_unchecked(encoding_as_byte_array.to_vec());
}

// "�"
println!("Encoding string read from byte array: {:?}", string_from_byte_array);

let mut bytes: &[u8] = string_from_byte_array.as_bytes();
// [200, 1] (correct)
println!("Bytes obtained from the encoding when read as a byte array: {:?}", bytes);

The issue here is that when reading from the YAML file, the encodings (Mapping keys) get automatically interpreted as UTF-8 strings, so the original bytes are lost:

use serde::Deserialize;
use serde_yaml::{Deserializer, Value};

let f = std::fs::read(yaml_dir).expect("Unable to read file");
    
for doc in Deserializer::from_slice(&f) {
    let spec = Value::deserialize(doc).expect("Unable to parse document");

    // Mapping {..., "examples": Mapping {"\0": Number(0), "\u{1}": Number(1), "\n": Number(10), "È\u{1}": Number(200), "è\u{7}": Number(1000), "©F": Number(9001), "ÿÿÿÿÿÿÿÿÿ\u{1}": Number(-1)}}
    println!("YAML spec interpreted: {:?}", spec);
}

A more specific example using serde_yaml:

// Sequence [Number(200), Number(1)] (Correct, but how to make the YAML get interpreted like this?)
let bytes = serde_yaml::to_value(b"\xc8\x01").unwrap();

// String("È\u{1}") (Incorrect)
let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();

I'm using serde_yaml but any other approach would be acceptable. How can I make it so that the encodings in the YAML, exactly as they are written, are correctly interpreted as byte arrays instead of strings?

I know serde_yaml has methods such as deserialize_bytes, but I'm not sure how to apply them in this case.

Alternatively, is there a way to continue reading the encodings normally as UTF-8 strings and then extract the original non-UTF-8 bytes from them?

Upvotes: 0

Views: 1381

Answers (2)

Caesar
Caesar

Reputation: 8484

A superficial reading of serde_yaml's code suggests that it will always try to convert your YAML string keys to str (which must fail since they're not valid utf8) and you can't get a [u8] out of them. I suggest you change your YAML:

examples:
  [0]: 0
  [0x01]: 1
  [0x0A]: 10
  [0xc8, 0x01]: 200
  [0xe8, 0x07]: 1000
  [0xa9, 0x46]: 9001
  [0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x01]: -1

This can be parsed by serde_yaml, but alas, you said you don't want to do that.

Upvotes: 1

KamilCuk
KamilCuk

Reputation: 140880

\u00c8 is UTF-16 for character È. That's not 200. That's character È. You have written character È. Not 200.

195, 136 or 0xC3 0x88 is UTF-8 for character È. This is how character È is represented as bytes in Rust.

If you want to print UTF-16 of a character, you want to print u16, not u8. Try:

fn main() {
    let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();
    let v: Vec<u16> = st.as_str().unwrap().encode_utf16().collect();
    println!("{} {}", v[0], v[1]);
}

Upvotes: 2

Related Questions