santosh singh
santosh singh

Reputation: 28662

ANTLR Match DateTIme in String

I have written following grammar for parsing DateTime from a given string

datetime:  INT SEPARATOR month SEPARATOR INT4 
        | INT SEPARATOR month SEPARATOR INT4;

month:
    JAN
    | FEB
    | MAR
    | APR
    | MAY
    | JUN
    | JUL
    | AUG
    | SEP
    | OCT
    | NOV
    | DEC;


STRING: [a-zA-Z][a-zA-Z]+;

NUMBER: [0-9]+;

INT4: DIGIT DIGIT DIGIT DIGIT;
INT: DIGIT+;
DIGIT: ['0'-'9'];
DQUOTE  : '"';
JAN: [Jj][Aa][Nn];
FEB: [Ff][Ee][Bb];
MAR: [Mm][Aa][Rr];
APR: [Aa][Pp][Rr];
MAY: [Mm][Aa][Yy];
JUN: [Jj][Uu][Nn];
JUL: [Jj][Uu][Ll];
AUG: [Aa][Uu][Gg];
SEP: [Ss][Ee][Pp];
OCT: [Oo][Cc][Tt];
NOV: [Nn][Oo][Vv];
DEC: [Dd][Ee][Cc];

SEPARATOR: '-';
WS: [ \n\t\r]+ -> skip;

When I am trying to match the following string

new teatime at 23-SEP-2013 for Santosh Singh and 3 guests

I am getting the following error in ANTLR output

line 1:15 mismatched input '23' expecting INT

Upvotes: 1

Views: 93

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170227

First, the DIGIT: ['0'-'9']; rule is incorrect, it should be: DIGIT: [0-9];

Whenever you get unexpected results, start by dumping the tokens your lexer is creating to see if they are the tokens you expect your parser to work with. For your grammar, that would be the following tokens:

STRING                    `new`
STRING                    `teatime`
STRING                    `at`
NUMBER                    `23`
SEPARATOR                 `-`
STRING                    `SEP`
SEPARATOR                 `-`
NUMBER                    `2013`
STRING                    `for`
STRING                    `Santosh`
STRING                    `Singh`
STRING                    `and`
NUMBER                    `3`
STRING                    `guests`

As you can see, there are a couple of things going wrong:

  1. no INT tokens are ever created, while your parser expects such tokens. This is because of the following rules (and their order):
NUMBER : [0-9]+;
INT4   : DIGIT DIGIT DIGIT DIGIT;
INT    : DIGIT+;
DIGIT  : [0-9];

For the input 3, the rules NUMBER, INT and DIGIT could be matched. Whenever ANTLR's lexer can construct more than 1 token, the token (lexer rule) defined first "wins". So, a single digit token, or any amount of digit token, will always become a NUMBER token. INT4, INT and DIGIT will never be created, no matter if the parser is trying to match any of these tokens. The lexer works independently from the parser. Nothing you can do about that.

  1. the months are never matched, they're all STRING tokens. The same as with the issue above: "SEP" can be matched by the STRING rule and by the SEP rule, but since STRING is defined before SEP, the one defined first "wins".

Reordering the grammar a bit like this:

grammar T;

parse
 : (datetime | text)+ EOF
 ;

text
 : STRING
 | month
 | INT
 ;

datetime
 : INT SEPARATOR month SEPARATOR INT4
 | INT SEPARATOR month SEPARATOR INT4
 ;

month
 : JAN
 | FEB
 | MAR
 | APR
 | MAY
 | JUN
 | JUL
 | AUG
 | SEP
 | OCT
 | NOV
 | DEC
 ;

JAN : [Jj][Aa][Nn];
FEB : [Ff][Ee][Bb];
MAR : [Mm][Aa][Rr];
APR : [Aa][Pp][Rr];
MAY : [Mm][Aa][Yy];
JUN : [Jj][Uu][Nn];
JUL : [Jj][Uu][Ll];
AUG : [Aa][Uu][Gg];
SEP : [Ss][Ee][Pp];
OCT : [Oo][Cc][Tt];
NOV : [Nn][Oo][Vv];
DEC : [Dd][Ee][Cc];

STRING    : [a-zA-Z][a-zA-Z]+;
INT4      : DIGIT DIGIT DIGIT DIGIT;
INT       : DIGIT+;
DQUOTE    : '"';
SEPARATOR : '-';

WS: [ \n\t\r]+ -> skip;

fragment DIGIT : [0-9];

should match your input correctly.

Upvotes: 2

Related Questions