SSteve
SSteve

Reputation: 10708

How do I specify an optional trailing character in ANTLR4?

I'm trying to match either W or W* in a grammar. Here's a stripped-down version:

grammar PdfStream;

content : stat* ;

stat
     : wCap ;

wCap: 'W' '*'? ; // Set clipping path using nonzero winding ('W') or even-odd ('W*') rule

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;

WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

When I try it with this input:

W W* W

my visitor gets the first W but nothing after that.

But if I use this definition instead:

wCap: ('W'|'W*') ;

the visitor gets all three statements.

That's a workable solution but I'm curious as to why my first definition doesn't work.

Upvotes: 1

Views: 186

Answers (1)

SpencerPark
SpencerPark

Reputation: 3506

Antlr lexer rules prefer longest match and then break ties by taking the rule defined first.

To make things clearer, lets move the implicit tokens 'W' and '*' into named lexer rules because this is internally what antlr is doing. I'll call them DUB_U and STAR to try and avoid confusion when referring to the rule vs the characters. Hence your lexer is really:

DUB_U: 'W' ;
STAR: '*' ;
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ;

When matching W W* W you get the token stream:

DUB_U(W) WS( ) ID(W*) WS( ) DUB_U(W)

This is because W* matches ID and it is longer than simply matching W from DUB_U.

When you change the rule to wCap: ('W'|'W*') ; your new effective lexer is:

DUB_U: 'W' ;
DUB_U_STAR: 'W*' ;
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ;

When matching W W* W you get the token stream:

DUB_U(W) WS( ) DUB_U_STAR(W*) WS( ) DUB_U(W)

This is because W* matches both DUB_U_STAR and ID, but DUB_U_STAR is defined first and the tiebreaker makes a DUB_U_STAR token.

Upvotes: 2

Related Questions