Reputation: 101

Overlapping Tokens in ANTLR 4

I have the following ANTLR 4 combined grammar:

grammar Example;

fieldList:  field* ;

field:      'field' identifier '{' note '}' ;

note:       NOTE ;
identifier: IDENTIFIER ;

NOTE:       [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS:         [ \t\r\n]+ -> skip ;

This parses:

field x { A }
field x { B }

This does not:

field a { A }
field b { B }

In the case where parsing fails, I think the lexer is getting confused and emitting a NOTE token where I want it to emit an IDENTIFIER token.

Edit:

In the tokens coming out of the lexer, the 'NOTE' token is showing up where the parser is expecting 'IDENTIFIER'. 'NOTE' has higher precedence because it's shown first in the grammar. So, I can think of two ways to fix this... first, I could alter the grammar to disambiguate 'NOTE' and 'IDENTIFIER' (like adding a '$' in front of 'NOTE'). Or, I could just use 'IDENTIFIER' where I would use note and then deal with detecting issues when I walk the parse tree. Neither of those feel optimal. Surely there must be a way to fix this?

Upvotes: 5

Answers (2)

Michael Quigley

Reputation: 101

I actually ended up solving it like this:

grammar Example;

fieldList:  field* ;

field:      'field' identifier '{' note '}' ;

note:       NOTE ;
identifier: IDENTIFIER | NOTE ;

NOTE:       [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS:         [ \t\r\n]+ -> skip ;

My parse tree still ends up looking how I'd like.

The actual grammar I'm developing is more complicated, as is the workaround based on this approach. But in general, the approach seems to work well.

Upvotes: 5

Cv4

Reputation: 162

Quick and dirty fix for your problem can be: Change IDENTIFIERto match only the complement of NOTE. Then you put them together in identifier.

Resulting grammar:

grammar Example;

fieldList:  field* ;

field:      'field' identifier '{' note '}' ;

note:       NOTE ;
identifier: (NOTE|IDENTIFIER_C)+ ;

NOTE:       [A-Ga-g] ;
IDENTIFIER_C: [H-Zh-z0-9] ;
WS:         [ \t\r\n]+ -> skip ;

Disadvantage of this solution is, that you do not get the Identifier as tokens and you tokenize every single Character.

Upvotes: 1

Overlapping Tokens in ANTLR 4

Answers (2)

Related Questions