Vallo
Vallo

Reputation: 1967

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:

IF ANSWERED ( variable )

END IF

Variable

Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.

I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.

grammar program;

/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;

ifStart: IF ANSWERED '(' VARIABLE ')';

ifEnd: 'END IF';

/** lexer */

IF: 'IF';
ANSWERED: 'ANSWERED';

TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;

fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];

If I try to parse IF ANSWERED ( FirstName ) I get the following output:

[@0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[@1,11:11='(',<'('>,1:11]
[@2,12:25='Execution date',<VARIABLE>,1:12]
[@3,26:26=')',<')'>,1:26]
[@4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'

I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.

Upvotes: 1

Views: 114

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170158

Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.

Something like this could get you started:

parse
 : command* EOF
 ;

command
 : (ifStatement | variable)+
 ;

ifStatement
 : IF ANSWERED '(' variable ')' command* END IF
 ;

variable
 : TEXT
 ;

IF       : 'IF';
END      : 'END';
ANSWERED : 'ANSWERED';
TEXT     : [a-zA-Z0-9]+;
SPACES   : [ \t\r\n]+ -> skip;

Upvotes: 2

Related Questions