Andrei Sibircevs
Andrei Sibircevs

Reputation: 157

Scala. Difficult expression parser. OutOfMemoryError

I would like to create a parser for difficult expression with order of operations. I have some example but it works very slowly and throw exception OutOfMemoryError. How can I improve it?

def expr: Parser[Expression] = term5
def term5: Parser[Expression] =
    (term4 ~ "OR" ~ term4) ^^ { case lhs~o~rhs => BinaryOp("OR", lhs, rhs) } |
    term4
def term4: Parser[Expression] =
    (term3 ~ "AND" ~ term3) ^^ { case lhs~a~rhs => BinaryOp("AND", lhs, rhs) } |
    term3
def term3: Parser[Expression] =
    (term2 ~ "<>" ~ term2) ^^ { case lhs~ne~rhs => BinaryOp("NE", lhs, rhs) } |
    (term2 ~ "=" ~ term2) ^^ { case lhs~eq~rhs => BinaryOp("EQ", lhs, rhs) } |
    (term2 ~ "NE" ~ term2) ^^ { case lhs~ne~rhs => BinaryOp("NE", lhs, rhs) } |
    (term2 ~ "EQ" ~ term2) ^^ { case lhs~eq~rhs => BinaryOp("EQ", lhs, rhs) } |
    term2
def term2: Parser[Expression] =
    (term1 ~ "<" ~ term1) ^^ { case lhs~lt~rhs => BinaryOp("LT", lhs, rhs) } |
    (term1 ~ ">" ~ term1) ^^ { case lhs~gt~rhs => BinaryOp("GT", lhs, rhs) } |
    (term1 ~ "<=" ~ term1) ^^ { case lhs~le~rhs => BinaryOp("LE", lhs, rhs) } |
    (term1 ~ ">=" ~ term1) ^^ { case lhs~ge~rhs => BinaryOp("GE", lhs, rhs) } |
    (term1 ~ "LT" ~ term1) ^^ { case lhs~lt~rhs => BinaryOp("LT", lhs, rhs) } |
    (term1 ~ "GT" ~ term1) ^^ { case lhs~gt~rhs => BinaryOp("GT", lhs, rhs) } |
    (term1 ~ "LE" ~ term1) ^^ { case lhs~le~rhs => BinaryOp("LE", lhs, rhs) } |
    (term1 ~ "GE" ~ term1) ^^ { case lhs~ge~rhs => BinaryOp("GE", lhs, rhs) } |
    term1
def term1: Parser[Expression] =
    (term ~ "+" ~ term) ^^ { case lhs~plus~rhs => BinaryOp("+", lhs, rhs) } |
    (term ~ "-" ~ term) ^^ { case lhs~minus~rhs => BinaryOp("-", lhs, rhs) } |
    (term ~ ":" ~ term) ^^ { case lhs~concat~rhs => BinaryOp(":", lhs, rhs) } |
    term 
def term: Parser[Expression] =
    (factor ~ "*" ~ factor) ^^ { case lhs~times~rhs => BinaryOp("*", lhs, rhs) } |
    (factor ~ "/" ~ factor) ^^ { case lhs~div~rhs => BinaryOp("/", lhs, rhs) } |
    (factor ~ "MOD" ~ factor) ^^ { case lhs~div~rhs => BinaryOp("MOD", lhs, rhs) } |
    factor
def factor: Parser[Expression] =
    "(" ~> expr <~ ")" |
    ("+" | "-") ~ factor ^^ { case op~rhs => UnaryOp(op, rhs) } |
    function |
    numericLit ^^ { case x => Number(x/*.toFloat*/) } |
    stringLit ^^ { case s => Literal(s) } |
    ident ^^ { case id => Variable(id) }

Upvotes: 1

Views: 242

Answers (2)

Andrei Sibircevs
Andrei Sibircevs

Reputation: 157

My first improvement was like that:

def term3: Parser[Expression] =
log((term2 ~ ("<>" | "=" | "NE" | "EQ") ~ term2) ^^ { case lhs~op~rhs => BinaryOp(op, lhs, rhs) })("term3 <>,=,NE,EQ") |
log(term2)("term3 term2")

It works without OutOfMemoryError but to slow. After viewing doc/scala-devel-docs/examples/parsing/lambda/TestParser.scala I got this source:

def expr: Parser[Expression] = term5
def term5: Parser[Expression] =
    log(chainl1(term4, term5, "OR" ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term5 OR")
def term4: Parser[Expression] =
    log(chainl1(term3, term4, "AND" ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term4 AND")
def term3: Parser[Expression] =
    log(chainl1(term2, term3, ("<>" | "=" | "NE" | "EQ") ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term3 <>,=,NE,EQ")
def term2: Parser[Expression] =
    log(chainl1(term1, term2, ("<" | ">" | "<=" | ">=" | "LT" | "GT" | "LE" | "GE") ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term2 <,>,...")
def term1: Parser[Expression] =
    log(chainl1(term, term1, ("+" | "-" | ":") ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term1 +,-,:")
def term: Parser[Expression] =
    log(chainl1(factor, term, ("*" | "/" | "MOD") ^^ {o => (a: Expression, b: Expression) => BinaryOp(o, a, b)}))("term *,/,MOD")
def factor: Parser[Expression] =
    log("(" ~> expr <~ ")")("factor ()") |
    log(("+" | "-") ~ factor ^^ { case op~rhs => UnaryOp(op, rhs) })("factor unary") |
    log(function)("factor function") |
    log(numericLit ^^ { case x => Number(x/*.toFloat*/) })("factor numLit") |
    log(stringLit ^^ { case s => Literal(s) })("factor strLit") |
    log(ident ^^ { case id => Variable(id) })("factor ident")

It works fast. I'm sorry but I can not understand how chainl1 function improve my source. I don't understand how it works.

Upvotes: 0

Daniel C. Sobral
Daniel C. Sobral

Reputation: 297265

Basically, it's slow and consumes too much memory because your grammar it is incredibly inefficient.

Let's consider the second line: B = A:(1+2). It will try to parse this line like this:

  1. term4 OR term4 and then term4.
  2. term3 AND term3 and then term3.
  3. term2 <> term2, then =, then NE then EQ and then term2.
  4. term1 8 different operators term1, then term1.
  5. term + term, term - term, term : term and then term.
  6. factor * factor, factor / factor, factor MOD factor and then factor.
  7. parenthesis expression, unary factor, function, numeric literal, string literal, ident.

The first try is like this:

ident * factor + term < term1 <> term2 AND term3 OR term4

I'm skipping parenthesis, unary, function, numeric and string literals because they don't match A -- though function probably does, but it's definition isn't available. Now, since : doesn't match *, it will try next:

ident / factor + term < term1 <> term2 AND term3 OR term4
ident MOD factor + term < term1 <> term2 AND term3 OR term4
ident + term < term1 <> term2 AND term3 OR term4

Now it goes to the next term1:

ident * factor - term < term1 <> term2 AND term3 OR term4
ident / factor - term < term1 <> term2 AND term3 OR term4
ident MOD factor - term < term1 <> term2 AND term3 OR term4
ident - term < term1 <> term2 AND term3 OR term4

And next:

ident * factor : term < term1 <> term2 AND term3 OR term4
ident / factor : term < term1 <> term2 AND term3 OR term4
ident MOD factor : term < term1 <> term2 AND term3 OR term4
ident : term < term1 <> term2 AND term3 OR term4

Aha! We finally got a match on term1! But ( doesn't match <, so it has to try the next term2:

ident * factor + term > term1 <> term2 AND term3 OR term4

etc...

All because the first term in each line for each term will always match! To match a simple number it has to parse factor 2 * 2 * 5 * 9 * 4 * 4 = 2880 times!

But that's not half of the story! You see, because termX is repeated twice, it will repeat all this stuff on both sides. For example, the first match for A:(1+2) is this:

ident : term < term1 <> term2 AND term3 OR term4
where ident = A
and   term  = (1+2)

Which is incorrect, so it will try > instead of <, and then <=, etc, etc.

I'm putting a logging version of this parser below. Try to run it and see all the things it is trying to parse.

Meanwhile, there are good examples of how to write these parsers available. Using sbaz, try:

sbaz install scala-devel-docs

Then look inside the doc/scala-devel-docs/examples/parsing directory of the Scala distribution and you'll find several examples.

Here's a version of your parser (without function) that logs everything it tries:

sealed trait Expression
case class Variable(id: String) extends Expression
case class Literal(s: String) extends Expression
case class Number(x: String) extends Expression
case class UnaryOp(op: String, rhs: Expression) extends Expression
case class BinaryOp(op: String, lhs: Expression, rhs: Expression) extends Expression

object TestParser extends scala.util.parsing.combinator.syntactical.StdTokenParsers {
    import scala.util.parsing.combinator.lexical.StdLexical
    type Tokens = StdLexical
    val lexical = new StdLexical
    lexical.delimiters ++= List("(", ")", "+", "-", "*", "/", "=", "OR", "AND", "NE", "EQ", "LT", "GT", "LE", "GE", ":", "MOD")
    def stmts: Parser[Any] = log(expr.*)("stmts")
    def stmt: Parser[Expression] = log(expr <~ "\n")("stmt")
    def expr: Parser[Expression] = log(term5)("expr")
    def term5: Parser[Expression] = (
        log((term4 ~ "OR" ~ term4) ^^ { case lhs~o~rhs => BinaryOp("OR", lhs, rhs) })("term5 OR")
      | log(term4)("term5 term4")
    )
    def term4: Parser[Expression] = (
        log((term3 ~ "AND" ~ term3) ^^ { case lhs~a~rhs => BinaryOp("AND", lhs, rhs) })("term4 AND")
      | log(term3)("term4 term3")
    )
    def term3: Parser[Expression] = (
        log((term2 ~ "<>" ~ term2) ^^ { case lhs~ne~rhs => BinaryOp("NE", lhs, rhs) })("term3 <>")
      | log((term2 ~ "=" ~ term2) ^^ { case lhs~eq~rhs => BinaryOp("EQ", lhs, rhs) })("term3 =")
      | log((term2 ~ "NE" ~ term2) ^^ { case lhs~ne~rhs => BinaryOp("NE", lhs, rhs) })("term3 NE")
      | log((term2 ~ "EQ" ~ term2) ^^ { case lhs~eq~rhs => BinaryOp("EQ", lhs, rhs) })("term3 EQ")
      | log(term2)("term3 term2")
    )
    def term2: Parser[Expression] = (
        log((term1 ~ "<" ~ term1) ^^ { case lhs~lt~rhs => BinaryOp("LT", lhs, rhs) })("term2 <")
      | log((term1 ~ ">" ~ term1) ^^ { case lhs~gt~rhs => BinaryOp("GT", lhs, rhs) })("term2 >")
      | log((term1 ~ "<=" ~ term1) ^^ { case lhs~le~rhs => BinaryOp("LE", lhs, rhs) })("term2 <=")
      | log((term1 ~ ">=" ~ term1) ^^ { case lhs~ge~rhs => BinaryOp("GE", lhs, rhs) })("term2 >=")
      | log((term1 ~ "LT" ~ term1) ^^ { case lhs~lt~rhs => BinaryOp("LT", lhs, rhs) })("term2 LT")
      | log((term1 ~ "GT" ~ term1) ^^ { case lhs~gt~rhs => BinaryOp("GT", lhs, rhs) })("term2 GT")
      | log((term1 ~ "LE" ~ term1) ^^ { case lhs~le~rhs => BinaryOp("LE", lhs, rhs) })("term2 LE")
      | log((term1 ~ "GE" ~ term1) ^^ { case lhs~ge~rhs => BinaryOp("GE", lhs, rhs) })("term2 GE")
      | log(term1)("term2 term1")
    )
    def term1: Parser[Expression] = (
        log((term ~ "+" ~ term) ^^ { case lhs~plus~rhs => BinaryOp("+", lhs, rhs) })("term1 +")
      | log((term ~ "-" ~ term) ^^ { case lhs~minus~rhs => BinaryOp("-", lhs, rhs) })("term1 -")
      | log((term ~ ":" ~ term) ^^ { case lhs~concat~rhs => BinaryOp(":", lhs, rhs) })("term1 :")
      | log(term)("term1 term")
    )
    def term: Parser[Expression] = (
        log((factor ~ "*" ~ factor) ^^ { case lhs~times~rhs => BinaryOp("*", lhs, rhs) })("term *")
      | log((factor ~ "/" ~ factor) ^^ { case lhs~div~rhs => BinaryOp("/", lhs, rhs) })("term /")
      | log((factor ~ "MOD" ~ factor) ^^ { case lhs~div~rhs => BinaryOp("MOD", lhs, rhs) })("term MOD")
      | log(factor)("term factor")
    )
    def factor: Parser[Expression] = (
        log("(" ~> expr <~ ")")("factor (expr)")
      | log(("+" | "-") ~ factor ^^ { case op~rhs => UnaryOp(op, rhs) })("factor +-")
      //| function |
      | log(numericLit ^^ { case x => Number(x/*.toFloat*/) })("factor numericLit")
      | log(stringLit ^^ { case s => Literal(s) })("factor stringLit")
      | log(ident ^^ { case id => Variable(id) })("factor ident")
    )
    def parse(s: String) = stmts(new lexical.Scanner(s))
}

Upvotes: 4

Related Questions