Woody1193
Woody1193

Reputation: 7960

Pest doesn't parse a recursive grammar as I would expect

I'm using the pest crate to implement a recursive grammar in Rust:

id = _{ ASCII_ALPHA_LOWER ~ (ASCII_ALPHANUMERIC|"_")* }
integer = _{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)|"0" }
real = _{ ((integer ~ "." ~ ASCII_DIGIT*) | (integer? ~ "." ~ ASCII_DIGIT+)) ~ (("e"|"E") ~ ("-"|"+")? ~ ASCII_DIGIT+)? }

unaryop = _{ "sin"|"cos"|"tan"|"exp"|"ln"|"sqrt" }

inner_exp = _{ real|integer|"pi"|id }

exp = { SOI ~ ( inner_exp | (exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ inner_exp) | ("-" ~ exp) | ("(" ~ exp ~ ")") | (unaryop ~ "(" ~ exp ~ ")") ) ~ EOI }

However, I'm finding that pest isn't parsing the grammar as I would expect. For instance, 2+3 gives me an error of:

 --> 1:2
  |
1 | 2+3
  |  ^---
  |
  = expected EOI

It appears that the inner_exp choice is being parsed and then, when the + symbol is encountered, the parser doesn't know what to do. I'm pretty sure there's a problem with how I've written the exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ inner_exp choice but I'm not sure what exactly is causing the problem. If I replace that choice with exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ exp I get an error stating that the expression is left-recursive. How do I fix this grammar?

Upvotes: 2

Views: 2153

Answers (1)

sepp2k
sepp2k

Reputation: 370082

The choice operator in PEGs is ordered and works as follows: Given e = {alt1 | alt2}:

  • If alt1 can be matched successfully, alt1 is applied and alt2 is never tried.
  • Otherwise alt2 is matched
  • If alt2 can't match either, e fails to match

Now e = {e1 ~ e2} works as follows:

  • If e1 can be matched and e2 can be matched after it, both are matched sequentially.
  • Otherwise e fails to match.

So if you have something like e = {(e1 | e2) ~ e3}, the following will happen:

  • If e1 can be matched:
    • If e3 can be matched after e1, both are matched sequentially
    • Otherwise, e fails to match
  • If e1 fails to match, but e2 can be matched:
    • If e3 can be matched after e2, both are matched sequentially
    • Otherwise, e fails to match

Notably if e1 succeeds and e3 fails, it does not go back and try to match e2 instead. So if both e1 and e2 can produce a match, but only e2 allows e3 to be matched afterwards, (e1 | e2) ~ e3 will fail whereas (e1 ~ e3) | (e2 ~ e3) would succeed.

So in your grammar you have (inner_exp | ...) ~ EOI. Now for your input inner_exp produces a match, so per the above rules the other alternatives are never tried and it tries to match EOI next. EOI doesn't match, so the whole rule fails and you get the syntax error you get.

This explains the syntax error, but it's not the only problem your grammar has:

Your exp rule is recursive, but it's anchored via SOI and EOI, so it can never match anything other than the entire input. This means that the recursive calls will necessarily always fail. To fix this you should remove SOI and EOI from the definition of exp and instead have a main rule like start = {SOI ~ exp ~ EOI}.

Once you've done this, you'll get an error that your exp rule is now left-recursive, which pest does not support. To fix that, you can replace the left recursion with repetition like this (replacing both the inner_exp and exp ~ (...) ~ inner_exp alternatives) where operand is a rule that matches the constructs other than infix operations:

operand ~ (( "+"|"-"|"*"|"/"|"^") ~ operand)*

Incidentally this will also fix your current issue because you now no longer have an inner_exp alternative that's tried before the alternative for infix expressions.

Your last issue is that you're not taking operator precedence into account at all. You can fix that by introducing additional "levels" of expressions in addition to inner_exp and exp, so that only operators that have the same precedence are defined in the same rule and then each rule invokes the rule containing the next higher precedence to parse the operands. That would look like this:

exp = { summand ~ (("+" | "-") ~ summand)* }
summand = { factor ~ (("*" | "/" | "%") ~ factor)* }
factor = { unary ~ ("^" ~ unary)* }
unary = { "-" ~ unary | unaryop ~ "(" ~ exp ~ ")" | primary }
primary = { "(" ~ exp ~ ")" | real | integer | "pi" | id }

Upvotes: 6

Related Questions