Reputation: 23995
Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss
of a longer string S
that matches a given rule my_rule
?
Conceptually, I could accomplish this by looking for a match at position S[i]
, incrementing i
until I successfully retrieve a match or S
is exhausted.
However, in practice this doesn't work very well, because prefixes in S
might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss
in S
might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".
Is there an approach I haven't thought of, or an ANTLR feature I don't know about?
I'm using the Python bindings for ANTLR, if that matters.
EXAMPLE:
Given the following grammar:
grammar test ;
options { language=Python3; }
month returns [val]
: JAN {$val = 1}
| FEB {$val = 2}
| MAR {$val = 3}
| APR {$val = 4}
| MAY {$val = 5}
;
day_number returns [val]
: a=INT {$val = int($a.text)} ;
day returns [val]
: day_number WS? {$val = int($day_number.start.text)}
;
month_and_day returns [val]
: month WS day {$val = ($month.val, $day.val)}
| day WS ('of' WS)? month {$val = ($month.val, $day.val)}
;
WS : [ \n\t]+ ; // whitespace is not ignored
JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;
INT
: [1-9]
| '0' [1-9]
| '1' [0-9]
| '2' [0-3]
;
and the following script to test it:
import sys
sys.path.append('gen')
from testParser import testParser
from testLexer import testLexer
from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
value = parse(t)
print(value.val)
I get the following results:
# First input - good
(1, 6)
# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)
# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
File "test_grammar.py", line 25, in <module>
value = parse(t)
File "test_grammar.py", line 19, in parse
return parser.month_and_day()
File "gen/testParser.py", line 305, in month_and_day
localctx._day = self.day()
File "gen/testParser.py", line 243, in day
localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1
To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error
output and also wrap the exception in a try
or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.
(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)
Upvotes: 2
Views: 1824
Reputation: 23995
The solution I've settled on, based on a variant of an idea from @grosenberg's answer, is as follows.
1) Add a fallback lexer rule to match any text that isn't already matched by existing rules. Do not ignore/skip these tokens.
OTHER : . ;
2) Add a parser alternative to match either the rule of interest, or (with lower precedence) anything else:
month_and_day_or_null returns [val]
: month_and_day {$val = $month_and_day.val}
| . {$val = None}
;
3) In the application code, look for either a None
or a populated value:
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day_or_null()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
for i in range(len(t)):
value = parse(t[i:])
if value.val:
print(f"Position {i}: {value.val}")
break
This has the desired effect at match time:
Position 0: (1, 6)
Position 6: (1, 6)
Position 12: (1, 6)
Upvotes: 0
Reputation: 6001
Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring
ss
of a longer stringS
that matches a given rulemy_rule
?
The short answer is no. ANTLR does not work as a substitute/equivalent to any of the standard regex-based tools, like sed
and awk
.
The longer answer is yes, but with messy caveats. ANTLR expects to parse a structured, largely unambiguous input text. Text that is of no semantic significance can be ignored by adding the lexer rule (at lowest priority/bottom position)
IGNORE : . -> skip;
That way, anything not explicitly recognized in the lexer is ignored.
The next problem is the potential semantic overlap between 'normal' text and keywords, e.g. Jan (name) - Jan (month abrev). In general, this can be handled by adding a BaseErrorListener
to the parser to distinguish between real and meaningless errors. What constitutes real vs meaningless can involve complex corner cases depending on the application.
Finally, the rule
day_number returns [val]
: a=INT {$val = int($a.text)} ;
is returning an int
value not an INT
token, hence the error that is being reported. The rule should be
day_number : INT ;
Upvotes: 2