thevoiddancer
thevoiddancer

Reputation: 393

Parser or postlex causing an error in Lark

everyone. So, I'm parsing a shell output (mocked here) and I'm running into an error where I really don't expect. Minimum reproducible, working example is below:

from rich import print as rprint
import typing as tp

from lark import Lark, Transformer, Tree
from lark.indenter import Indenter

class TreeIndenter(Indenter):
    NL_type = '_NL'
    OPEN_PAREN_types: tp.List = []
    CLOSE_PAREN_types: tp.List = []
    INDENT_type = '_INDENT'
    DEDENT_type = '_DEDENT'
    tab_len = 8

    @property
    def always_accept(self):
        return (self.NL_type,)

kwargs = {
    "parser": "lalr",
    "postlex": TreeIndenter(),
    "maybe_placeholders": False,
}


text = """
=======================================================================
SKIT                       SEASON          EPISODE         CAST NUMBER
=======================================================================
skit_name=vikings          3               10              3
skit_name=parrot           2               5               2
skit_name=eel              1               7               2
"""

grammar = r"""
start: [_NL] header data
header: line_break column_names line_break 
data: data_line+
data_line: me_info (STRING2 | STRING)* _NL
me_info: "skit_name="STRING

line_break: "="* _NL
column_names: (STRING | STRING2)* _NL


STRING2         : STRING " " STRING
STRING          : ESCAPED_STRING | VALUE
VALUE           : ("_" | LETTER | DIGIT | "-" | "[]" | "/" | "." | ":")+

%import common.ESCAPED_STRING
%import common.LETTER
%import common.DIGIT
%ignore / /

_NL: /(\r?\n[\t ]*)+/
"""

parser = Lark(grammar=grammar)
rprint(parser.parse(text))

This outputs the correct tree. Do note that kwargs isn't being used.

However, as I'd need to combine it with parser for output that is indented, I'd need to use an Indenter and the listed kwargs. When I include them, I get the following error (full trace omitted):

UnexpectedToken: Unexpected token Token('STRING', 'skit_name') at line 5, column 1.
Expected one of: 
    * __ANON_0

Meaning that the first line that forms data causes the problem, but it's not obvious what is actually expected.

However, interesting thing, if the first line break is omitted (both from the text and grammar) it successfully parses.

Additionally, it seems that the error occurs when either parser or postlex are included, and it's the same error, no matter which of them is included in kwargs.

EDIT: So I was hoping I can come up with a workaround for the indent and not use parser or postlex keywords, but it seems that specifying lalr parser is required to use the Transformer, so I will need to use that anyways so I can't just side-step the problem.

Upvotes: 0

Views: 558

Answers (1)

MegaIng
MegaIng

Reputation: 7886

Providing a PostLexer changes the default parser/lexer combo from earley/dynamic to earley/basic, since the dynamic parser can't handle the postlexer. However, the basic lexer is far less powerful and can't handle this kind of ambiguity.It sees that at that point a STRING would fit and then just uses that.

There are a few possible solutions:

  • Explicitly use the parser/lexer combo lalr/contextual. This might require changes in the grammar, since lalr can't handle ambiguity like early can.
    • One thing that you can fix is to use "skit_name" "=". This will make the lexers correctly differentiate between "skit_name" and "NAME", as long as you have the contextual lexer. Yes, technically it means something else since you currently can't have whitespace there. If you really need that, you can use /\B=/ instead of /=/
  • Somehow avoid the need for a postlexer and continue to use earley. (You can always just apply the Transformer after the fact instead of inline)

Upvotes: 1

Related Questions