How can I parse this program correctly using Rust's parser library nom

Question

I want to parse a program represented by this BNF:

 :=  (* "&&" * )*
     :=  (* "&" * )*
   := ("&" *)* 
         := "a" | "b" | "c" | ...
              := " " | "	" | "
" | "
"

This BNF parses small program. log_and means Logical And.

and I wrote Rust program using parser library nom:

use nom::{
    branch::{alt, permutation},
    bytes::complete::{escaped_transform, is_not, tag, take_while_m_n},
    character::complete::{
        char, digit1, hex_digit1, line_ending, multispace0, multispace1, none_of, space0, space1,
    },
    combinator::{all_consuming, map, opt, value},
    error::VerboseError,
    multi::{many0, many1},
    number::complete::double,
    sequence::{delimited, tuple},
    IResult,
};

#[derive(Debug, PartialEq, Clone)]
pub struct LogAndExpression {
    pub and_expression: AndExpression,
    pub tail: Vec,
}
#[derive(Debug, PartialEq, Clone)]
pub struct AndExpression {
    pub unary_expression: UnaryExpression,
    pub tail: Vec,
}
#[derive(Debug, PartialEq, Clone)]
pub struct UnaryExpression {
    pub unary_operators: Vec,
    pub identifier: String,
}

pub fn log_and_expression(s: &str) -> IResult<&str, LogAndExpression> {
    map(
        tuple((
            and_expression,
            many0(
                map(
                    tuple((
                        multispace0,
                        tag("&&"),
                        multispace0,
                        and_expression,
                    )),
                    |(_, _, _, expression)| expression,
                )
            ),
        )),
        |(expression, vec)| LogAndExpression {
            and_expression: expression,
            tail: vec,
        },
    )(s)
}

pub fn and_expression(s: &str) -> IResult<&str, AndExpression> {
    map(
        tuple((
            unary_expression,
            many0(
                map(
                    tuple((
                        multispace0,
                        tag("&"),
                        multispace0,
                        unary_expression,
                    )),
                    |(_, _, _, expression)| expression,
                )
            ),
        )),
        |(expression, vec)| AndExpression {
            unary_expression: expression,
            tail: vec,
        },
    )(s)
}

pub fn unary_expression(s: &str) -> IResult<&str, UnaryExpression> {
    map(
        tuple((
            many0(
                map(
                    tuple((
                        tag("&"),
                        multispace0
                    )),
                    |(opr, _): (&str, &str)| opr.to_owned(),
                )
            ),
            is_not(" &"),
        )),
        |(unary_operators, identifier)| UnaryExpression {
            unary_operators: unary_operators,
            identifier: identifier.to_owned(),
        },
    )(s)
}

fn main() {
    println!("{:?}", log_and_expression("a && b"));
}

This code outputs the result of parse a && b. log_and_expression parses program.

The result I want is a LogAnd b. But result is:

Ok(("", LogAndExpression { and_expression: AndExpression { unary_expression: UnaryExpression { unary_operators: [], identifier: "a" }, tail: [UnaryExpression { unary_operators: ["&"], identifier: "b" }] }, tail: [] }))

this means a And &b.

How can I parse this program correctly using nom?

transistor · Accepted Answer

The BNF rules as listed are ambiguous, since there's no easy way to distinguish between the correct way to parse a&&b. It could be parsed as a && b or a & &b, and since the latter option is the first thing the nom parser version will attempt to parse, it will return that option first.

In the log_and_expression function, the parser will call the and_expression function which will correctly match the a identifier followed by the first &, and then it will call unary_expression function to parse the remainder, which in this case can be correctly parsed as the unary & b expression, since that rule explicitly allows spaces between & and b. This gives you the output you get (which isn't what you want)

One way to fix this is to rewrite the parser so that it matches the smallest unary expression, followed by many0 of either a && follow by another unary expression, or & followed by a unary expression. Something like

 =  (*  *)*
 = ("&&" * ) | ("&" * )

If it tries to match && first, then that will be the favored way of parsing a&&b, and the alternative form can only be parsed if it's written with a space as in a& &b. This might require building a list of pairs of "operator" and "unary expression", which would then need to be reassembled into the correct tree-like format you're looking for. An advantage of doing this as a 2 stage process is that you can then correctly apply order of operations when supporting multiple operators with different precedence.

There's a simple example of this kind of approach with nom here

How can I parse this program correctly using Rust's parser library nom

Answers (1)

Related Questions

How can I parse this program correctly using Rust&#39;s parser library nom

Answers (1)

Related Questions

How can I parse this program correctly using Rust's parser library nom