Jon Firuz
Jon Firuz

Reputation: 115

ANTLR lexer rule consumes too much

ANTLR Lexer Rule Design

I have a requirement for the following token:

The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"

The following input fails to parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"

The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.

I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.

grammar T;

sic: SICSpecifier AlphaNumericSpaceHyphen  LEFTBRACKET Digits RIGHTBRACKET;

LEFTBRACKET  
:   '[';  

RIGHTBRACKET 
:   ']';

SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:';

WS : (' '|'\t')+ 
{   
  $channel = HIDDEN;  
};  

fragment UCASEALPHA : 'A'..'Z';
fragment LCASEALPHA : 'a'..'z';
fragment DIGIT : '0'..'9';
Digits: DIGIT+;

AlphaNumericSpaceHyphen 
:           (UCASEALPHA|LCASEALPHA |DIGIT|'-')+  (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+   
        |   (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+  ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
        |   ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+  ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?   
        ;

Upvotes: 1

Views: 417

Answers (1)

Zakaria Jaiathe
Zakaria Jaiathe

Reputation: 70

Unfortunately there is no backtracking for the lexer rules. You can take a look at

ANTLR lexer rule consumes characters even if not matched?

You can try to adapt your grammar so that you can change the type of the token as it is suggested in this solution.

Hope this is going to help you.

Upvotes: 0

Related Questions