Lark matching custom delimiter multiline strings

Question

I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:

$my_var .= << 'anydelim';
some things
other things
anydelim

While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.

If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!

user766308 · Accepted Answer

A solution using a regexp. Key ingredients:

back references, in this case named references
the /s modifier (causes . to also match newlines
.*? to match non greedy (otherwise it would also consume the delimiter)

from lark import Lark

block_grammar = r"""
    %import common.WS
    %ignore WS
    delimited_string: "<<" /(?P['"])(?P[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")

ast = minimal_parser.parse(r"""
    << 'SomeDelim'; fasdfasdf 
    fddfsdg SomeDelim
""")
print(ast.pretty())

Lark matching custom delimiter multiline strings

Answers (2)

Related Questions