Reputation: 12092
The following extremely simple example grammar doesn't lex as I expected (at all).
Declaration : 'VAR';
Letter: ('A'..'Z');
message : Declaration Letter+;
what I expected as a result is that any sequence of letters would lex as individual letters, and the sequence 'VAR' would be lexed as a single token.
When I look at the ANTLRWorks interperter I see the following results:
VARA
parses into message -> "VAR", "A"
(expected)VARVA
doesn't parse (MismatchedTokenException(-1 != 5). The lexer hits the second VA
and tries to tokenise Declaration
. Expected: message -> "VAR", "V", "A"
VARVPP
parses into message -> "VAR", "V", "P", "P"
(expected)VARVALL
parses into message -> "VAR", "VALL"
. I would like some help understanding this behaviour, and a suggestion how I can fix this.
Specifically:
VA
into Declaration if it is followed by one letter?V
?Upvotes: 4
Views: 197
Reputation: 26868
The lexer doesn't really perform lookahead, only the parser does; you can read more about it in ANTLR lexer can't lookahead at all. So the problem here is that, once the lexer fails to match VAR
, it tries to match what it got so far - VA
- and there's no matching token, since Letter cannot match two characters, just one.
As for a solution, a trivial one is to change it to be a single token:
Message : 'VAR' ('A'..'Z')+;
message : Message;
It won't net you a different token for each letter, though.
Upvotes: 2
Reputation: 170148
Let's go through all of your 4 examples:
All okay.
"VAR"
is (obviously) tokenized as VAR
, but then the lexer "sees" "VA"
and expects an "R"
, which is not there. It emits the following errors:
line 1:5 mismatched character '<EOF>' expecting 'R'
line 1:5 required (...)+ loop did not match anything at input '<EOF>'
and discards the "VA"
resulting in a single token to be created, as you can see when running ANTLRWorks' debugger (ignore the exceptions in the parse, they're not actually there :)):
The thing you must realize is that the lexer will never give up on something it has already matched. So if the lexer sees "VA"
and cannot match an "R"
after it, it will then look at the other lexer rules that can match "VA"
. But Letter
does not match that (it only matches single letters!) If you change Letter
to match more than a single character, ANTLR would be able to fall back on that rule. But not when it matches a single letter: the lexer will not give up the "A"
from "VA"
to let the Letter
rule match. No way around it: this is how ANTLR's lexer works.
This is usually not an issue because there is often some sort of IDENTIFIER
rule that the lexer can fall back on when a keyword cannot be matched.
All okay: "VAR"
becomes a VAR
and then the lexer tries to match an "A"
after the "V"
but this does not happen, so the lexer falls back on the Letter
rule for the single "V"
. After that "PP"
are both tokenized as Letter
s.
"VAR"
again becomes a VAR
. Then the "L"
in "VAL"
causes the lexer to produce the following error message:
line 1:5 mismatched character 'L' expecting 'R'
and then the last "L"
becomes a Letter
:
I guess (or hope) the first 3 question are now answered, which leaves your final answer:
How should I change this grammar to parse the way I expected?
By forcing the lexer to first look ahead in the character stream if there really is "VAR"
ahead, and if there's not, just match a single "V"
and change the type of the matched token to Letter
, like this:
Declaration
: ('VAR')=> 'VAR'
| 'V' {$type=Letter;}
;
As mentioned before my answer, see this related Q&A: ANTLR lexer can't lookahead at all
Upvotes: 5