ANTLR4 lexer rules not matching correct block of text

Question

I am trying to understand how ANTLR4 works based on lexer and parser rules but I am missing something in the following example:

I am trying to parse a file and match all mathematic additions (eg 1+2+3 etc.). My file contains the following text:

start
4 + 5 + 22 + 1
other text other text test test
test test other text
55 other text
another text 2 + 4 + 255
number 44
end

and I would like to match

4 + 5 + 22 + 1

and

2 + 4 + 255

My grammar is as follows:

grammar Hello;
hi : expr+ EOF;
expr : NUM (PLUS NUM)+;

PLUS : '+' ;
NUM : [0-9]+ ;
SPACE : [

	 ]+ ->skip;
OTHER : [a-z]+ ;

My abstract Syntax Tree is visualized as

Why does rule 'expr' matches the text 'start'? I also get an error "extraneous input 'start' expecting NUM"

If i make the following change in my grammar

OTHER : [a-z]+ ->skip;

the error is gone. In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?

All the above have to do with the way lexer matches an input? I know that lexer looks for the first longest matching rule but how can I change my grammar so as to match only the additions?

sepp2k · Accepted Answer

Why does rule 'expr' matches the text 'start'?

It doesn't. When a token shows up red in the tree, that indicates an error. The token did not match any of the possible alternatives, so an error was produced and the parser continued with the next token.

In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?

After you skipped the OTHER tokens, your input basically looks like this:

4 + 5 + 22 + 1 55 2 + 4 + 255 44

4 + 5 + 22 + 1 can be parsed as an expression, no problem. After that the parser either expects a + (continuing the expression) or a number (starting a new expression). So when it sees 55, that indicates the start of a new expression. Now it expects a + (because the grammar says that PLUS NUM must appear at least once after the first number in an expression). What it actually gets is the number 2. So it produces an error and ignores that token. Then it sees a +, which is what it expected. And then it continues that way until the 44, which again starts a new expression. Since that isn't followed by a +, that's another error.

All the above have to do with the way lexer matches an input?

Not really. The token sequence for "start 4 + 5" is OTHER NUM PLUS NUM, or just NUM PLUS NUM if you skip the OTHERs. The token sequence for "55 skippedtext 2 + 4" is NUM NUM PLUS NUM. I assume that's exactly what you'd expect.

Instead what seems to be confusing you is how ANTLR recovers from errors (or maybe that it recovers from errors).

ANTLR4 lexer rules not matching correct block of text

Answers (1)

Related Questions