jimmy
jimmy

Reputation: 63

Binary file parsing with nom 5.0

Problem

There is a file that has multiple headers inside it, but to me, it only matters one and the data after it. This header repeats itself multiple times through the file.

Its magic number is: A3046 in ASCII, or 0x65 0x51 0x48 0x54 0x52 in HEX. After finding the first byte, the parser has to take all bytes until 0xff and then repeat for the remainder headers until the EOF.

My solution

First I loaded the file:

let mut file = OpenOptions::new()
        .read(true)
        .open("../assets/sample")
        .unwrap();

    let mut full_file: Vec<u8> = Vec::new();
    file.read_to_end(&mut full_file);

I declare the magic numbers with: pub static QT_MAGIC: &[u8; 5] = b"A3046"; And as a test, I wrote the following function just to try if it could find the first header.

fn parse_block(input: &[u8]) -> IResult<&[u8], &[u8]> {
    tag(QT_MAGIC)(input)
}

However when the test runs, Ok has None value. It definitely should have found something. What I am doing wrong?

I found no examples of bytes parsing using nom5, and also being a rust newbie is not helping. How can I parse all the blocks with these rules?

Upvotes: 4

Views: 3383

Answers (1)

S&#233;bastien Renauld
S&#233;bastien Renauld

Reputation: 19672

The nom version

First off, apologies for this one, the playground only has nom 4.0 and as a result, the code is on this github repository.

To parse something like this, we're going to need to combine two different parser:

  • take_until, to take bytes until either the preamble or EOF
  • tag, to isolate the preamble

And a combinator, preceded, so we can ditch the first element of a sequence of parsers.

// Our preamble
const MAGIC:&[u8] = &[0x65, 0x51, 0x48, 0x54, 0x52];
// Our EOF byte sequence
const EOF:&[u8] = &[0xff];

// Shorthand to catch EOF
fn match_to_eof(data: &[u8]) -> nom::IResult<&[u8], &[u8]> {
    nom::bytes::complete::take_until(EOF)(data)
}

// Shorthand to catch the preamble
fn take_until_preamble(data: &[u8]) -> nom::IResult<&[u8], &[u8]> {
    nom::bytes::complete::take_until(MAGIC)(data)
}
pub fn extract_from_data(data: &[u8]) -> Option<(&[u8], &[u8])> {
    let preamble_parser = nom::sequence::preceded(
        // Ditch anything before the preamble
        take_until_preamble,
        nom::sequence::preceded(
            // Ditch the preamble
            nom::bytes::complete::tag(MAGIC),
            // And take until the EOF (0xff)
            match_to_eof
        )
    );
    // And we swap the elements because it's confusing AF
    // as a return function
    preamble_parser(data).ok().map(|r| {
        (r.1, r.0)
    })
}

The code should be annotated well enough to follow. This ditches any bytes until it finds the preamble bytes, then ditches those and keeps everything until it finds the EOF byte sequence ([0xff]).

It then returns a reversed nom result, because it was an example. You can un-reverse it to combine it with other parsers if you like. The first element is the content of the sequence, the second is whatever was after the EOF. This means that you can iterate with this function (I did that in a test in the repo I put on github).

Upvotes: 8

Related Questions