Peter Hall
Peter Hall

Reputation: 58785

Error while parsing a date with Antlr4

I'm trying to parse dates using the following grammar:

grammar Dates;

formattedDate : (DATE '/' MONTH '/' year);
year : SHORT_YEAR | FULL_YEAR;

SHORT_YEAR : DIGIT DIGIT;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT;
DATE : (('0'..'2')? DIGIT) | '30' | '31';
MONTH : ('0'? DIGIT) | '11' | '12';

fragment DIGIT : ('0' .. '9');

But it fails to parse the values that I would expect to work. For example, an input of 11/04/2017 produces the error:

   line 1:0 mismatched input '11' expecting DATE

My first guess was that there are some values (1-12) that the lexer can't decide if it's a DATE or a MONTH, which is causing the problem. But when I tried to fix it by replacing them with parser rules instead, I had the same problem:

formattedDate : (dateNum '/' monthNum '/' year);

year : shortYear | fullYear;
shortYear : DIGIT DIGIT;
fullYear : ('19' | '20' | '21') DIGIT DIGIT;
dateNum : (('0'..'2')? DIGIT) | '30' | '31';
monthNum : ('0'? DIGIT) | '11' | '12';

fragment DIGIT : ('0' .. '9');

And it still seems to struggle on the first value, even it something like 31, outside of the range of ambiguity.

What am I doing wrong here?

Upvotes: 1

Views: 93

Answers (1)

KinGamer
KinGamer

Reputation: 509

As you say, "the tokens overlap" (note 31 is ambiguous, it could be a short year). In cases like this, the longest possible matching lexer rule will be chosen. In case there are two or more matching with the same length, it'll choose the first (in the order they appear). (I think I've read this some time ago in www.antlr.org)

So just changing the order of the rules "solves" the problem – or pushes it forward (note DATE is before SHORT_YEAR and MONTH):

grammar Dates;

formattedDate : (DATE '/' MONTH '/' year);
year : SHORT_YEAR | FULL_YEAR;

DATE : (('0'..'2')? DIGIT) | '30' | '31';
SHORT_YEAR : DIGIT DIGIT;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT;
MONTH : ('0'? DIGIT) | '11' | '12';

fragment DIGIT : ('0' .. '9');

yields line 1:3 mismatched input '04' expecting MONTH.


A possible solution is to use lexer grammar modes:

DatesLexer.g4:

lexer grammar DatesLexer;

// Mode expecting DATE (default mode)
DATE : (('0'..'2')? DIGIT) | '30' | '31';
DATE_BAR : '/'
    -> pushMode(readingMonth);

// Mode expecting MONTH
mode readingMonth;
MONTH : ('0'? DIGIT) | '11' | '12';
MONTH_BAR : '/'
    -> popMode, pushMode(readingYear);

// Mode expecting *_YEAR
mode readingYear;
SHORT_YEAR : DIGIT DIGIT
    -> popMode;
FULL_YEAR : ('19' | '20' | '21') DIGIT DIGIT
    -> popMode;

fragment DIGIT : ('0' .. '9');

DatesParser.g4:

parser grammar DatesParser;

options { tokenVocab=DatesLexer; }

formattedDate : (DATE  DATE_BAR  MONTH  MONTH_BAR  year);
year : SHORT_YEAR | FULL_YEAR;

Result:

A parse tree showing that the date 11/04/2017 was successfully parsed

Only for reference:

> antlr4 DatesLexer.g4 [-o outDir]
> antlr4 DatesParser.g4 [-o outDir]
> [cd outDir]
> javac *.java
> grun Dates formattedDate -tokens <file> [-gui]
[@0,0:1='11',<1>,1:0]
[@1,2:2='/',<2>,1:2]
[@2,3:4='04',<3>,1:3]
[@3,5:5='/',<4>,1:5]
[@4,6:9='2017',<6>,1:6]
[@5,10:9='<EOF>',<-1>,1:10]

Upvotes: 1

Related Questions