Reputation: 10708
I'm trying to match either W or W* in a grammar. Here's a stripped-down version:
grammar PdfStream;
content : stat* ;
stat
: wCap ;
wCap: 'W' '*'? ; // Set clipping path using nonzero winding ('W') or even-odd ('W*') rule
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
When I try it with this input:
W W* W
my visitor gets the first W but nothing after that.
But if I use this definition instead:
wCap: ('W'|'W*') ;
the visitor gets all three statements.
That's a workable solution but I'm curious as to why my first definition doesn't work.
Upvotes: 1
Views: 186
Reputation: 3506
Antlr lexer rules prefer longest match and then break ties by taking the rule defined first.
To make things clearer, lets move the implicit tokens 'W'
and '*'
into named lexer rules because this is internally what antlr is doing. I'll call them DUB_U
and STAR
to try and avoid confusion when referring to the rule vs the characters. Hence your lexer is really:
DUB_U: 'W' ;
STAR: '*' ;
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ;
When matching W W* W
you get the token stream:
DUB_U(W) WS( ) ID(W*) WS( ) DUB_U(W)
This is because W*
matches ID
and it is longer than simply matching W
from DUB_U
.
When you change the rule to wCap: ('W'|'W*') ;
your new effective lexer is:
DUB_U: 'W' ;
DUB_U_STAR: 'W*' ;
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ;
When matching W W* W
you get the token stream:
DUB_U(W) WS( ) DUB_U_STAR(W*) WS( ) DUB_U(W)
This is because W*
matches both DUB_U_STAR
and ID
, but DUB_U_STAR
is defined first and the tiebreaker makes a DUB_U_STAR
token.
Upvotes: 2