Reputation: 12102
For a markup language I'm trying to parse, I decided to give parser generation a try with ANTLR. I'm new to the field, and I'm messing something up.
My grammar is
grammar Test;
DIGIT : ('0'..'9');
LETTER : ('A'..'Z');
SLASH : '/';
restriction
: ('E' ap)
| ('L' ap)
| 'N';
ap : LETTER LETTER LETTER;
car : LETTER LETTER;
fnum : DIGIT DIGIT DIGIT DIGIT? LETTER?;
flt : car fnum?;
message : 'A' (SLASH flt)? (SLASH restriction)?;
which does exactly what I want, when I give it an input string A/KK543/EPOS
. When I give it A/KL543/EPOS
however, it fails (MismatchedTokenException(9!=5)
). It seems like some sort of conflict; it wants to generate restriction
on the first L, so it seems I'm doing something wrong in the language definition, but I can't properly find out what.
Upvotes: 1
Views: 55
Reputation: 170257
For the input "A/KK543/EPOS"
, the following tokens are created:
'A' 'A' SLASH '/' LETTER 'K' LETTER 'K' DIGIT '5' DIGIT '4' DIGIT '3' SLASH '/' 'E' 'E' LETTER 'P' LETTER 'O' LETTER 'S'
But for the input "A/KL543/EPOS"
, these are created:
'A' 'A' SLASH '/' LETTER 'K' 'L' 'L' DIGIT '5' DIGIT '4' DIGIT '3' SLASH '/' 'E' 'E' LETTER 'P' LETTER 'O' LETTER 'S'
As you can see, the char 'L'
does not get tokenized as a LETTER
. For the literal tokens 'A'
, 'E'
, 'L'
and 'N'
inside your parser rules, ANTLR (automatically) creates separate lexer rules that are place before all other lexer rules. This causes your lexer to look like this behind the scenes:
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
Therefor, any single 'A'
, 'E'
, 'L'
and 'N'
will never become a LETTER
token. This is simply how ANTLR works. If you want to match them as letters, you'll need to create a parser rule letter
and let it match these tokens too. Something like this:
message
: A (SLASH flt)? (SLASH restriction)?
;
flt
: car fnum?
;
fnum
: DIGIT DIGIT DIGIT DIGIT? letter?
;
restriction
: E ap
| L ap
| N
;
ap
: letter letter letter
;
car
: letter letter
;
letter
: A
| E
| L
| N
| LETTER
;
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
which will parse the input "A/KL543/EPOS"
like this:
Upvotes: 3