Ken Williams
Ken Williams

Reputation: 23995

ANTLR - find first match for grammar within string

Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?

Conceptually, I could accomplish this by looking for a match at position S[i], incrementing i until I successfully retrieve a match or S is exhausted.

However, in practice this doesn't work very well, because prefixes in S might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss in S might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".

Is there an approach I haven't thought of, or an ANTLR feature I don't know about?

I'm using the Python bindings for ANTLR, if that matters.

EXAMPLE:

Given the following grammar:

grammar test ;

options { language=Python3; }


month returns [val]
  : JAN {$val = 1}
  | FEB {$val = 2}
  | MAR {$val = 3}
  | APR {$val = 4}
  | MAY {$val = 5}
  ;

day_number returns [val]
  : a=INT {$val = int($a.text)} ;

day returns [val]
  : day_number WS?     {$val = int($day_number.start.text)}
  ;

month_and_day returns [val]
  : month WS day             {$val = ($month.val, $day.val)}
  | day WS ('of' WS)? month  {$val = ($month.val, $day.val)}
  ;


WS : [ \n\t]+ ;  // whitespace is not ignored

JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;


INT
  : [1-9]
  | '0' [1-9]
  | '1' [0-9]
  | '2' [0-3]
  ;

and the following script to test it:

import sys

sys.path.append('gen')


from testParser import testParser
from testLexer import testLexer


from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream


def parse(text: str):
    date_input = InputStream(text.lower())
    lexer = testLexer(date_input)
    stream = CommonTokenStream(lexer)
    parser = testParser(stream)
    return parser.month_and_day()


for t in ['Jan 6',
          'hello Jan 6, 1984',
          'hello maybe Jan 6, 1984']:
    value = parse(t)
    print(value.val)

I get the following results:

# First input - good
(1, 6)

# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)

# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
  File "test_grammar.py", line 25, in <module>
    value = parse(t)
  File "test_grammar.py", line 19, in parse
    return parser.month_and_day()
  File "gen/testParser.py", line 305, in month_and_day
    localctx._day = self.day()
  File "gen/testParser.py", line 243, in day
    localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1

To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error output and also wrap the exception in a try or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.

(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)

Upvotes: 2

Views: 1824

Answers (2)

Ken Williams
Ken Williams

Reputation: 23995

The solution I've settled on, based on a variant of an idea from @grosenberg's answer, is as follows.

1) Add a fallback lexer rule to match any text that isn't already matched by existing rules. Do not ignore/skip these tokens.

OTHER : . ;

2) Add a parser alternative to match either the rule of interest, or (with lower precedence) anything else:

month_and_day_or_null returns [val]
  : month_and_day  {$val = $month_and_day.val}
  | .              {$val = None}
  ;

3) In the application code, look for either a None or a populated value:

def parse(text: str):
    date_input = InputStream(text.lower())
    lexer = testLexer(date_input)
    stream = CommonTokenStream(lexer)
    parser = testParser(stream)
    return parser.month_and_day_or_null()

for t in ['Jan 6',
          'hello Jan 6, 1984',
          'hello maybe Jan 6, 1984']:
    for i in range(len(t)):
        value = parse(t[i:])
        if value.val:
            print(f"Position {i}: {value.val}")
            break

This has the desired effect at match time:

Position 0: (1, 6)
Position 6: (1, 6)
Position 12: (1, 6)

Upvotes: 0

GRosenberg
GRosenberg

Reputation: 6001

Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?

The short answer is no. ANTLR does not work as a substitute/equivalent to any of the standard regex-based tools, like sed and awk.

The longer answer is yes, but with messy caveats. ANTLR expects to parse a structured, largely unambiguous input text. Text that is of no semantic significance can be ignored by adding the lexer rule (at lowest priority/bottom position)

IGNORE : . -> skip; 

That way, anything not explicitly recognized in the lexer is ignored.

The next problem is the potential semantic overlap between 'normal' text and keywords, e.g. Jan (name) - Jan (month abrev). In general, this can be handled by adding a BaseErrorListener to the parser to distinguish between real and meaningless errors. What constitutes real vs meaningless can involve complex corner cases depending on the application.

Finally, the rule

day_number returns [val]
  : a=INT {$val = int($a.text)} ;

is returning an int value not an INT token, hence the error that is being reported. The rule should be

day_number : INT ;

Upvotes: 2

Related Questions