ANTLR : issue with greedy rule

Question

I never worked with ANTLR and generative grammars, so this is my first attempt.

I have a custom language I need to parse. Here's an example:

-- This is a comment
CMD.CMD1:foo_bar_123
CMD.CMD2
CMD.CMD4:9 of 28 (full)
CMD.NOTES:
This is an note.
    A line 
      (1) there could be anything here foo_bar_123 & $ £ _ , . ==> BOOM
      (3) same here
CMD.END_NOTES:

Briefly, there could be 4 types of lines:

1) -- comment
2) .
3) .: 
4) .:
       
       
       ...
   .:

is the literal "CMD"

is a single word (uppercase, lowercase letters, numbers, '_')

is the same word of but preceded by the literal "end_"

could be any character

Here's what I've done so far:

grammar MyGrammar;

/*
* Parser Rules
*/

root                : line+ EOF ;

line                : (comment_line | command_line | normal_line) NEWLINE;

comment_line        : COMMENT ;

command_line        : section '.' command ((COLON WHITESPACE*)? arg)? ;

normal_line         : TEXT ;

section             : CMD ;

command             : WORD ;

arg                 : TEXT ;

/*
* Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT      : [0-9] ;

NUMBER          : DIGIT+ ([.,] DIGIT+)? ;

CMD             : 'CMD';

COLON           : ':' ;

COMMENT         : '--' ~[
]*;

WHITESPACE      : (' ' | '	') ;

NEWLINE         : ('
'? '
' | '
')+;

WORD            : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;

TEXT            : ~[
]* ;

This is a test for my grammar:

$antlr4 MyGrammar.g4

warning(146): MyGrammar.g4:45:0: non-fragment lexer rule TEXT can match the empty string

$javac MyGrammar*.java

$grun MyGrammar root -tokens

CMD.NEW

[@0,0:6='CMD.NEW',,1:0]

[@1,7:7=' ',,1:7]

[@2,8:7='',,2:0]

The problem is that "CMD.NEW" gets swallowed by TEXT, because that rule is greedy.

Anyone can help me with this? Thanks

Pavel Smirnov · Accepted Answer

There is a grammar ambiguity.

In the example you have provided CMD.NEW can match both command_line and normal_line.
Thus, given the expression:

 line                : (comment_line | command_line | normal_line) NEWLINE;

the parser can not definitely say what rule to accept (command_line or normal_line), so it matches it to normal_line which is actually a simple TEXT.

Consider rewriting your grammar in the way the parser can always say what rule to accept.

UPDATE:

Try this (I did not test that, but it should work):

grammar MyGrammar;

/*
* Parser Rules
*/

root                : line+ EOF ;

line                : (comment_line | command_line) NEWLINE;

comment_line        : COMMENT ;

command_line        : CMD '.' (note_cmd | command);

command             : command_name ((COLON WHITESPACE*)? arg)? ;

note_cmd            : notes .*? (CMD '.' END_NOTES) ;

command_name             : WORD ;

arg                 : TEXT ;

/*
* Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT      : [0-9] ;

NUMBER          : DIGIT+ ([.,] DIGIT+)? ;

CMD             : 'CMD';

COLON           : ':' ;

COMMENT         : '--' ~[
]*;

WHITESPACE      : (' ' | '	') ;

NEWLINE         : ('
'? '
' | '
')+;

WORD            : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;

NOTES            : 'NOTES';

END_NOTES        : 'END_NOTES';

TEXT            : ~[
]* ;

ANTLR : issue with greedy rule

Answers (1)

Related Questions