Reputation: 301
Lets say I have this grammar, written with Antlr4:
grammar Test;
start : expr* ;
expr : expr '-' expr
| INT ;
MINUS : '-' ;
INT: MINUS? DIGIT+ ; // Disclaimer: this definition of an integer is just for illustration purposes
DIGIT : '0'..'9' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
My thought process is that 1-1
should be the same as 1 - 1
; which should be expr '-' expr
. In case of 1 - 1
start
expr(-)
expr(1) expr(1)
Above tree seems correct, which again evaluated to expr '-' expr
.
But when not using spaces, antlr think there are two INT expr. In case of 1-1
start
expr(1) expr(-1)
Should not all whitespaces (with the WS
rule) be skipped, which means both of the expression should be parsed the same way?
Upvotes: 1
Views: 109
Reputation: 170138
Lexer rules match as much characters as possible, so - 1
is tokenised as a MINUS
and an INT
and -1
(without the space) as s single INT
.
You must realise that the lexer does not listen to the parser. If the parser tries to match the tokens INT MINUS INT
for the input 1-1
, the lexer does not produce these tokens. Because the lexer matches as much characters as possible, it will always create two INT
tokens for that input (no MINUS
!). Parsing and tokenisation are 2 separate steps.
Upvotes: 1