Reputation: 103
I have a YAML file with test cases for encoding and decoding elements. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original element. For example, the VarInt test cases are:
examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1
The encodings (left-hand side) for the first three examples work correctly when read as strings (which are automatically interpreted as UTF-8 in Rust).
However, the fourth example (200) and the subsequent ones don't yield the correct results. Using the encoding for 200 ("\u00c8\u0001") as an example:
Reading the encoding as a UTF-8 string (incorrect):
use bytes::{Buf, BufMut};
let encoding_as_utf8_string = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Encoding as a UTF-8 string: {:?}", encoding_as_utf8_string);
let mut utf8_bytes: &[u8] = encoding_as_utf8_string.as_bytes();
// [195, 136, 1] (Incorrect)
println!("Bytes obtained from the encoding when read as a UTF-8 string: {:?}", utf8_bytes);
Reading the encoding as a byte array (correct):
use bytes::{Buf, BufMut};
let string_from_byte_array: String;
unsafe {
let encoding_as_byte_array: &[u8; 2] = b"\xc8\x01";
string_from_byte_array = String::from_utf8_unchecked(encoding_as_byte_array.to_vec());
}
// "�"
println!("Encoding string read from byte array: {:?}", string_from_byte_array);
let mut bytes: &[u8] = string_from_byte_array.as_bytes();
// [200, 1] (correct)
println!("Bytes obtained from the encoding when read as a byte array: {:?}", bytes);
The issue here is that when reading from the YAML file, the encodings (Mapping keys) get automatically interpreted as UTF-8 strings, so the original bytes are lost:
use serde::Deserialize;
use serde_yaml::{Deserializer, Value};
let f = std::fs::read(yaml_dir).expect("Unable to read file");
for doc in Deserializer::from_slice(&f) {
let spec = Value::deserialize(doc).expect("Unable to parse document");
// Mapping {..., "examples": Mapping {"\0": Number(0), "\u{1}": Number(1), "\n": Number(10), "È\u{1}": Number(200), "è\u{7}": Number(1000), "©F": Number(9001), "ÿÿÿÿÿÿÿÿÿ\u{1}": Number(-1)}}
println!("YAML spec interpreted: {:?}", spec);
}
A more specific example using serde_yaml:
// Sequence [Number(200), Number(1)] (Correct, but how to make the YAML get interpreted like this?)
let bytes = serde_yaml::to_value(b"\xc8\x01").unwrap();
// String("È\u{1}") (Incorrect)
let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();
I'm using serde_yaml but any other approach would be acceptable. How can I make it so that the encodings in the YAML, exactly as they are written, are correctly interpreted as byte arrays instead of strings?
I know serde_yaml has methods such as deserialize_bytes, but I'm not sure how to apply them in this case.
Alternatively, is there a way to continue reading the encodings normally as UTF-8 strings and then extract the original non-UTF-8 bytes from them?
Upvotes: 0
Views: 1381
Reputation: 8484
A superficial reading of serde_yaml
's code suggests that it will always try to convert your YAML string keys to str
(which must fail since they're not valid utf8) and you can't get a [u8]
out of them. I suggest you change your YAML:
examples:
[0]: 0
[0x01]: 1
[0x0A]: 10
[0xc8, 0x01]: 200
[0xe8, 0x07]: 1000
[0xa9, 0x46]: 9001
[0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x01]: -1
This can be parsed by serde_yaml
, but alas, you said you don't want to do that.
Upvotes: 1
Reputation: 140880
\u00c8
is UTF-16 for character È
. That's not 200. That's character È
. You have written character È
. Not 200.
195, 136
or 0xC3 0x88
is UTF-8 for character È
. This is how character È
is represented as bytes in Rust.
If you want to print UTF-16 of a character, you want to print u16
, not u8
. Try:
fn main() {
let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();
let v: Vec<u16> = st.as_str().unwrap().encode_utf16().collect();
println!("{} {}", v[0], v[1]);
}
Upvotes: 2