Elektito
Elektito

Reputation: 4155

Collision in between "END" and "END IF" in BASIC grammar, using Lark

I'm trying to create an LALR parser for BASIC using Lark, and I'm having a hard time fixing a collision between the "END" statement and statements like "END IF". Here's a simplified version of the grammar:

%ignore /[ \t\f]+/

program: _nlopt _part_list

_part_list: (stmt | block) _nl _part_list
          |

_nlopt: _nl
      |

_nl: _NEWLINE _nl
   | _NEWLINE

block: if_block

stmt: print_stmt
    | end_stmt

end_stmt: END_KW

if_block: IF_KW expr THEN_KW _nl block_body endif_stmt

endif_stmt: END_KW IF_KW

block_body: _block_body_item block_body
          |

_block_body_item: stmt _nl

print_stmt: PRINT_KW expr

?expr: NUMERIC_LITERAL

_NEWLINE: "\n"
NUMERIC_LITERAL: /[\-+]?\d+(\.\d*)?[!#%&]?/

END_KW: "end"i
IF_KW: "if"i
PRINT_KW: "print"i
THEN_KW: "then"i

If I try this grammar with a code like this:

parser = Lark(grammar, start='program', parser='lalr')

prog = r"""
if 1 then
   print 200
end if
"""
t = parser.parse(prog)
print(t.pretty())

This is what I get from Lark:

Traceback (most recent call last):
  File "test.py", line 230, in <module>
    t = parser.parse(prog)
  File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lark.py", line 250, in parse
    return self.parser.parse(text)
  File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/parser_frontends.py", line 37, in parse
    return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
  File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
    for token in stream:
  File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lexer.py", line 341, in lex
    for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
  File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lexer.py", line 175, in lex
    raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, allowed=allowed, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'i' at line 4 col 5

end if
    ^

Expecting: ['__IGNORE_0', '_NEWLINE']

If I remove the "end_stmt" rule, this will not happen. Is there a way to fix the grammar so that this does not happen?

Upvotes: 1

Views: 893

Answers (2)

user1524750
user1524750

Reputation:

I had that same conflict with my basic grammar. Basic language is LALR(2) or LR(2) because of the END WHILE, END IF, etc. If you have an LR(2) parser generator you can parse basic. LRSTAR parser generator can create LR(2) parsers.

Upvotes: 1

sepp2k
sepp2k

Reputation: 370425

By default Lark does not warn you about shift-reduce conflicts in the grammar and instead silently resolves them in favor of shifting. Often this leads to a parser that does not parse what you want it to - as is the case here. You can make lark warn you about conflicts like these by passing the debug = True flag to Lark(). That way you'll see that something's wrong even before finding the problem through tests and you might even get helpful information as to where the problem lies.

With the debug option enabled, you'll get a warning that's there's a shift-reduce conflict where an END_KW could either mean that the current block_body is over or it could be an end_stmt. This is a problem because a LALR(1) parser can only look one token ahead, but we'd have to look a second token ahead to see whether there's an if after the end to correctly decide which option to take.

You can fix this, somewhat hackishly, by turning end if into a single token like this:

ENDIF_KW: /end[ \t\f]+if/i

And then using ENDIF_KW instead of END_KW IF_KW.

PS: Note that your grammar will work fine without these changes if you use Earley parsing instead of LALR(1).

Upvotes: 3

Related Questions