Markup parser failing

Question

For a markup language I'm trying to parse, I decided to give parser generation a try with ANTLR. I'm new to the field, and I'm messing something up.

My grammar is

grammar Test;
DIGIT   :   ('0'..'9');
LETTER  :   ('A'..'Z');
SLASH   :   '/'; 
restriction
    :   ('E' ap)
    |   ('L' ap)
    |   'N';
ap  :   LETTER LETTER LETTER;
car :   LETTER LETTER;
fnum    :   DIGIT DIGIT DIGIT DIGIT? LETTER?;
flt :   car fnum?;
message :   'A' (SLASH flt)? (SLASH restriction)?;

which does exactly what I want, when I give it an input string A/KK543/EPOS. When I give it A/KL543/EPOS however, it fails (MismatchedTokenException(9!=5)). It seems like some sort of conflict; it wants to generate restriction on the first L, so it seems I'm doing something wrong in the language definition, but I can't properly find out what.

Bart Kiers · Accepted Answer

For the input "A/KK543/EPOS", the following tokens are created:

'A'        'A'
SLASH      '/'
LETTER     'K'
LETTER     'K'
DIGIT      '5'
DIGIT      '4'
DIGIT      '3'
SLASH      '/'
'E'        'E'
LETTER     'P'
LETTER     'O'
LETTER     'S'

But for the input "A/KL543/EPOS", these are created:

'A'        'A'
SLASH      '/'
LETTER     'K'
'L'        'L'
DIGIT      '5'
DIGIT      '4'
DIGIT      '3'
SLASH      '/'
'E'        'E'
LETTER     'P'
LETTER     'O'
LETTER     'S'

As you can see, the char 'L' does not get tokenized as a LETTER. For the literal tokens 'A', 'E', 'L' and 'N' inside your parser rules, ANTLR (automatically) creates separate lexer rules that are place before all other lexer rules. This causes your lexer to look like this behind the scenes:

A      : 'A';
E      : 'E';
L      : 'L';
N      : 'N';
DIGIT  : '0'..'9';
LETTER : 'A'..'Z';
SLASH  : '/';

Therefor, any single 'A', 'E', 'L' and 'N' will never become a LETTER token. This is simply how ANTLR works. If you want to match them as letters, you'll need to create a parser rule letter and let it match these tokens too. Something like this:

message
 : A (SLASH flt)? (SLASH restriction)?
 ;

flt
 : car fnum?
 ;

fnum
 : DIGIT DIGIT DIGIT DIGIT? letter?
 ;

restriction
 : E ap
 | L ap
 | N
 ;

ap
 : letter letter letter
 ;

car
 : letter letter
 ;

letter
 : A
 | E
 | L
 | N
 | LETTER
 ;

A      : 'A';
E      : 'E';
L      : 'L';
N      : 'N';
DIGIT  : '0'..'9';
LETTER : 'A'..'Z';
SLASH  : '/';

which will parse the input "A/KL543/EPOS" like this:

enter image description here

Markup parser failing

Answers (1)

Related Questions