user766308
user766308

Reputation: 123

Lark matching custom delimiter multiline strings

I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:

$my_var .= << 'anydelim';
some things
other things
anydelim

While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.

If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!

Upvotes: 1

Views: 657

Answers (2)

MegaIng
MegaIng

Reputation: 7886

If you want to do complex back references across multiple Terminals, e.g. you can't use a single regex, you need to use a PostLexer (or worst case, a Custom lexer). A Small example with a XML like structure:

<html>
    <body>
        Hello World
    </body>
</html>

Could be parsed (an validated) by this grammar + Postlexer:

from typing import Iterator

from lark import Lark, Token

TEXT = r"""
<html>
    <body>
        Hello World
    </body>
</html>
"""

GRAMMAR = r"""
start: node

node: OPEN_TAG content* CLOSE_TAG
content: node
       | TEXT 

TEXT: /[^\s<>]+/
RAW_OPEN: "<" /\w+/ ">"
RAW_CLOSE: "</" /\w+/ ">"

%ignore WS

%import common.WS

%declare OPEN_TAG CLOSE_TAG
"""


class MatchTag:
    always_accept = "RAW_OPEN", "RAW_CLOSE"

    def process(self, stream: Iterator[Token]) -> Iterator[Token]:
        stack = []
        for t in stream:
            if t.type == "RAW_OPEN":
                stack.append(t)
                t.type = "OPEN_TAG"
            elif t.type == "RAW_CLOSE":
                open_tag = stack.pop()
                if open_tag.value[1:-1] != t.value[2:-1]:
                    raise ValueError(f"Non matching closing tag (expected {open_tag.value!r}, got {t.value!r})")
                t.type = "CLOSE_TAG"
            yield t


parser = Lark(GRAMMAR, parser='lalr', postlex=MatchTag())

print(parser.parse(TEXT).pretty())

(Note: Don't use Lark if you actually want to parse XML. There are a lot of pitfalls that are hard to impossible to deal with)

Upvotes: 1

user766308
user766308

Reputation: 123

A solution using a regexp. Key ingredients:

  • back references, in this case named references
  • the /s modifier (causes . to also match newlines
  • .*? to match non greedy (otherwise it would also consume the delimiter)
from lark import Lark

block_grammar = r"""
    %import common.WS
    %ignore WS
    delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")

ast = minimal_parser.parse(r"""
    << 'SomeDelim'; fasdfasdf 
    fddfsdg SomeDelim
""")
print(ast.pretty())

Upvotes: 1

Related Questions