Reputation: 4155
I'm trying to create an LALR parser for BASIC using Lark, and I'm having a hard time fixing a collision between the "END" statement and statements like "END IF". Here's a simplified version of the grammar:
%ignore /[ \t\f]+/
program: _nlopt _part_list
_part_list: (stmt | block) _nl _part_list
|
_nlopt: _nl
|
_nl: _NEWLINE _nl
| _NEWLINE
block: if_block
stmt: print_stmt
| end_stmt
end_stmt: END_KW
if_block: IF_KW expr THEN_KW _nl block_body endif_stmt
endif_stmt: END_KW IF_KW
block_body: _block_body_item block_body
|
_block_body_item: stmt _nl
print_stmt: PRINT_KW expr
?expr: NUMERIC_LITERAL
_NEWLINE: "\n"
NUMERIC_LITERAL: /[\-+]?\d+(\.\d*)?[!#%&]?/
END_KW: "end"i
IF_KW: "if"i
PRINT_KW: "print"i
THEN_KW: "then"i
If I try this grammar with a code like this:
parser = Lark(grammar, start='program', parser='lalr')
prog = r"""
if 1 then
print 200
end if
"""
t = parser.parse(prog)
print(t.pretty())
This is what I get from Lark:
Traceback (most recent call last):
File "test.py", line 230, in <module>
t = parser.parse(prog)
File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lark.py", line 250, in parse
return self.parser.parse(text)
File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/parser_frontends.py", line 37, in parse
return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
for token in stream:
File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lexer.py", line 341, in lex
for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
File "/home/user/.pyenv/versions/venv/lib/python3.6/site-packages/lark/lexer.py", line 175, in lex
raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, allowed=allowed, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'i' at line 4 col 5
end if
^
Expecting: ['__IGNORE_0', '_NEWLINE']
If I remove the "end_stmt" rule, this will not happen. Is there a way to fix the grammar so that this does not happen?
Upvotes: 1
Views: 893
Reputation:
I had that same conflict with my basic grammar. Basic language is LALR(2) or LR(2) because of the END WHILE, END IF, etc. If you have an LR(2) parser generator you can parse basic. LRSTAR parser generator can create LR(2) parsers.
Upvotes: 1
Reputation: 370425
By default Lark does not warn you about shift-reduce conflicts in the grammar and instead silently resolves them in favor of shifting. Often this leads to a parser that does not parse what you want it to - as is the case here. You can make lark warn you about conflicts like these by passing the debug = True
flag to Lark()
. That way you'll see that something's wrong even before finding the problem through tests and you might even get helpful information as to where the problem lies.
With the debug
option enabled, you'll get a warning that's there's a shift-reduce conflict where an END_KW
could either mean that the current block_body
is over or it could be an end_stmt
. This is a problem because a LALR(1) parser can only look one token ahead, but we'd have to look a second token ahead to see whether there's an if
after the end
to correctly decide which option to take.
You can fix this, somewhat hackishly, by turning end if
into a single token like this:
ENDIF_KW: /end[ \t\f]+if/i
And then using ENDIF_KW
instead of END_KW IF_KW
.
PS: Note that your grammar will work fine without these changes if you use Earley parsing instead of LALR(1).
Upvotes: 3