Reputation: 58785
I'm trying to parse dates using the following grammar:
grammar Dates;
formattedDate : (DATE '/' MONTH '/' year);
year : SHORT_YEAR | FULL_YEAR;
SHORT_YEAR : DIGIT DIGIT;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT;
DATE : (('0'..'2')? DIGIT) | '30' | '31';
MONTH : ('0'? DIGIT) | '11' | '12';
fragment DIGIT : ('0' .. '9');
But it fails to parse the values that I would expect to work. For example, an input of 11/04/2017
produces the error:
line 1:0 mismatched input '11' expecting DATE
My first guess was that there are some values (1-12) that the lexer can't decide if it's a DATE
or a MONTH
, which is causing the problem. But when I tried to fix it by replacing them with parser rules instead, I had the same problem:
formattedDate : (dateNum '/' monthNum '/' year);
year : shortYear | fullYear;
shortYear : DIGIT DIGIT;
fullYear : ('19' | '20' | '21') DIGIT DIGIT;
dateNum : (('0'..'2')? DIGIT) | '30' | '31';
monthNum : ('0'? DIGIT) | '11' | '12';
fragment DIGIT : ('0' .. '9');
And it still seems to struggle on the first value, even it something like 31
, outside of the range of ambiguity.
What am I doing wrong here?
Upvotes: 1
Views: 93
Reputation: 509
As you say, "the tokens overlap" (note 31
is ambiguous, it could be a short year). In cases like this, the longest possible matching lexer rule will be chosen. In case there are two or more matching with the same length, it'll choose the first (in the order they appear). (I think I've read this some time ago in www.antlr.org)
So just changing the order of the rules "solves" the problem – or pushes it forward (note DATE
is before SHORT_YEAR
and MONTH
):
grammar Dates;
formattedDate : (DATE '/' MONTH '/' year);
year : SHORT_YEAR | FULL_YEAR;
DATE : (('0'..'2')? DIGIT) | '30' | '31';
SHORT_YEAR : DIGIT DIGIT;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT;
MONTH : ('0'? DIGIT) | '11' | '12';
fragment DIGIT : ('0' .. '9');
yields line 1:3 mismatched input '04' expecting MONTH
.
A possible solution is to use lexer grammar modes:
DatesLexer.g4:
lexer grammar DatesLexer;
// Mode expecting DATE (default mode)
DATE : (('0'..'2')? DIGIT) | '30' | '31';
DATE_BAR : '/'
-> pushMode(readingMonth);
// Mode expecting MONTH
mode readingMonth;
MONTH : ('0'? DIGIT) | '11' | '12';
MONTH_BAR : '/'
-> popMode, pushMode(readingYear);
// Mode expecting *_YEAR
mode readingYear;
SHORT_YEAR : DIGIT DIGIT
-> popMode;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT
-> popMode;
fragment DIGIT : ('0' .. '9');
DatesParser.g4:
parser grammar DatesParser;
options { tokenVocab=DatesLexer; }
formattedDate : (DATE DATE_BAR MONTH MONTH_BAR year);
year : SHORT_YEAR | FULL_YEAR;
Result:
Only for reference:
> antlr4 DatesLexer.g4 [-o outDir]
> antlr4 DatesParser.g4 [-o outDir]
> [cd outDir]
> javac *.java
> grun Dates formattedDate -tokens <file> [-gui]
[@0,0:1='11',<1>,1:0]
[@1,2:2='/',<2>,1:2]
[@2,3:4='04',<3>,1:3]
[@3,5:5='/',<4>,1:5]
[@4,6:9='2017',<6>,1:6]
[@5,10:9='<EOF>',<-1>,1:10]
Upvotes: 1