Reputation: 393
everyone. So, I'm parsing a shell output (mocked here) and I'm running into an error where I really don't expect. Minimum reproducible, working example is below:
from rich import print as rprint
import typing as tp
from lark import Lark, Transformer, Tree
from lark.indenter import Indenter
class TreeIndenter(Indenter):
NL_type = '_NL'
OPEN_PAREN_types: tp.List = []
CLOSE_PAREN_types: tp.List = []
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 8
@property
def always_accept(self):
return (self.NL_type,)
kwargs = {
"parser": "lalr",
"postlex": TreeIndenter(),
"maybe_placeholders": False,
}
text = """
=======================================================================
SKIT SEASON EPISODE CAST NUMBER
=======================================================================
skit_name=vikings 3 10 3
skit_name=parrot 2 5 2
skit_name=eel 1 7 2
"""
grammar = r"""
start: [_NL] header data
header: line_break column_names line_break
data: data_line+
data_line: me_info (STRING2 | STRING)* _NL
me_info: "skit_name="STRING
line_break: "="* _NL
column_names: (STRING | STRING2)* _NL
STRING2 : STRING " " STRING
STRING : ESCAPED_STRING | VALUE
VALUE : ("_" | LETTER | DIGIT | "-" | "[]" | "/" | "." | ":")+
%import common.ESCAPED_STRING
%import common.LETTER
%import common.DIGIT
%ignore / /
_NL: /(\r?\n[\t ]*)+/
"""
parser = Lark(grammar=grammar)
rprint(parser.parse(text))
This outputs the correct tree. Do note that kwargs
isn't being used.
However, as I'd need to combine it with parser for output that is indented, I'd need to use an Indenter and the listed kwargs. When I include them, I get the following error (full trace omitted):
UnexpectedToken: Unexpected token Token('STRING', 'skit_name') at line 5, column 1.
Expected one of:
* __ANON_0
Meaning that the first line that forms data causes the problem, but it's not obvious what is actually expected.
However, interesting thing, if the first line break is omitted (both from the text and grammar) it successfully parses.
Additionally, it seems that the error occurs when either parser or postlex are included, and it's the same error, no matter which of them is included in kwargs.
EDIT: So I was hoping I can come up with a workaround for the indent and not use parser or postlex keywords, but it seems that specifying lalr parser is required to use the Transformer, so I will need to use that anyways so I can't just side-step the problem.
Upvotes: 0
Views: 558
Reputation: 7886
Providing a PostLexer changes the default parser/lexer combo from earley/dynamic
to earley/basic
, since the dynamic parser can't handle the postlexer. However, the basic lexer is far less powerful and can't handle this kind of ambiguity.It sees that at that point a STRING
would fit and then just uses that.
There are a few possible solutions:
lalr/contextual
. This might require changes in the grammar, since lalr can't handle ambiguity like early can.
"skit_name" "="
. This will make the lexers correctly differentiate between "skit_name" and "NAME", as long as you have the contextual lexer. Yes, technically it means something else since you currently can't have whitespace there. If you really need that, you can use /\B=/
instead of /=/
earley
. (You can always just apply the Transformer after the fact instead of inline)Upvotes: 1