Reputation: 123
I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:
$my_var .= << 'anydelim';
some things
other things
anydelim
While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.
If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!
Upvotes: 1
Views: 657
Reputation: 7886
If you want to do complex back references across multiple Terminals, e.g. you can't use a single regex, you need to use a PostLexer (or worst case, a Custom lexer). A Small example with a XML like structure:
<html>
<body>
Hello World
</body>
</html>
Could be parsed (an validated) by this grammar + Postlexer:
from typing import Iterator
from lark import Lark, Token
TEXT = r"""
<html>
<body>
Hello World
</body>
</html>
"""
GRAMMAR = r"""
start: node
node: OPEN_TAG content* CLOSE_TAG
content: node
| TEXT
TEXT: /[^\s<>]+/
RAW_OPEN: "<" /\w+/ ">"
RAW_CLOSE: "</" /\w+/ ">"
%ignore WS
%import common.WS
%declare OPEN_TAG CLOSE_TAG
"""
class MatchTag:
always_accept = "RAW_OPEN", "RAW_CLOSE"
def process(self, stream: Iterator[Token]) -> Iterator[Token]:
stack = []
for t in stream:
if t.type == "RAW_OPEN":
stack.append(t)
t.type = "OPEN_TAG"
elif t.type == "RAW_CLOSE":
open_tag = stack.pop()
if open_tag.value[1:-1] != t.value[2:-1]:
raise ValueError(f"Non matching closing tag (expected {open_tag.value!r}, got {t.value!r})")
t.type = "CLOSE_TAG"
yield t
parser = Lark(GRAMMAR, parser='lalr', postlex=MatchTag())
print(parser.parse(TEXT).pretty())
(Note: Don't use Lark if you actually want to parse XML. There are a lot of pitfalls that are hard to impossible to deal with)
Upvotes: 1
Reputation: 123
A solution using a regexp. Key ingredients:
from lark import Lark
block_grammar = r"""
%import common.WS
%ignore WS
delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")
ast = minimal_parser.parse(r"""
<< 'SomeDelim'; fasdfasdf
fddfsdg SomeDelim
""")
print(ast.pretty())
Upvotes: 1