Ramg
Ramg

Reputation: 160

ANTLR4 : mismatched input

I am a newbie to antlr. I want to write a grammar to parse the below input:

commit a1b2c3d4

The grammar is given below ::

grammar commit;

file : 'commit' COMMITHASH NEWLINE;

COMMITHASH : [a-z0-9]+;
DATE       : ~[\r\n]+;
NEWLINE    : '\r'?'\n';

When I try parsing the above input using the grammar, it throws the below exception::

line 1:0 mismatched input 'commit a1b2c3d4' expecting 'commit'

Note : I have intentionally added the DATE token. Without the DATE token, it works fine. But I would like to know, what is happening when the DATE token is added.

I had referred the link Antlr4: Mismatched input but am not still clear about what happened.

Upvotes: 0

Views: 2527

Answers (1)

Sam Harwell
Sam Harwell

Reputation: 99869

ANTLR lexers fully assign unambiguous token types before the parser is ever used. When one lexer rule can match more characters than another lexer rule, the rule matching more characters is always preferred by ANTLR, regardless of the order in which the lexer rules appear in the grammar. When two or more rules match exactly the same length of input symbols (and no other rule matches more than this number of input symbols), a token type is assigned for the rule that appears first in the grammar.

Your lexer contains a rule DATE that matches all characters except for a newline character. Since this always matches the entire text of a line, and none of your tokens span multiple lines, the result is the following:

  • If the entire text of a single line matches commit, an unnamed token corresponding to this input sequence will be produced.
  • If the entire text of a single line matches [a-z0-9]+, a COMMITHASH token will be created for the entire text of the line. DATE also matches this input, but COMMITHASH appears first so it is used.
  • Otherwise, if the single line contains at least one character, a DATE token will be created for the entire text of the line. Even if the line starts with commit or a COMMITHASH, the DATE rule will be used because it matches a longer sequence of characters.
  • Finally, a NEWLINE token will be created for each newline.

You will need to do one of the following to resolve the problem. The exact strategy depends on the larger problem you are trying to solve.

  • Remove the DATE rule, or rewrite it to match a more specific date format.
  • Use semantic predicates and/or lexer modes to restrict the location(s) in the input where a DATE token might be produced.

Upvotes: 3

Related Questions