Reputation: 64205
I am trying to match below text with an ANTLR grammar:
The ANTLR grammar is:
grammar header;
start : commentBlock
EOF;
commentBlock : CommentLine+;
CommentLine : '#' AsciiChars+;
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
The error I got is:
line 1:0 token recognition error at: '# '
line 2:0 token recognition error at: '# '
line 3:0 token recognition error at: '# '
[@0,2:2='a',<AsciiChars>,1:2]
[@1,7:7='b',<AsciiChars>,2:2]
[@2,12:12='c',<AsciiChars>,3:2]
[@3,15:14='<EOF>',<EOF>,4:0]
line 1:2 mismatched input 'a' expecting CommentLine
I guess the grammar is reasonable, but why the error is happening?
Strange, after I changed the lexer rule CommentLine
into a parser rule commentLine
, it works:
grammar header;
start : commentBlock
EOF;
commentBlock : commentLine+;
commentLine : '#' AsciiChars+; // <=== here CommentLine -> commentLine
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
But actually I want to discard all the comment lines. If it has to be a parser rule, I cannot use ->skip
to discard it.
I think I can explain it now.
The critial things to remember are:
Let me explain it with a concise example:
The document to match:
# abc
Grammar 1:
grammar test;
t : T2;
p : t
EOF;
Char : [a-z];
T2 : '#' T1+ Char+; // <<<< Here T2 reference the so-skipped T1.
fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.
In grammar 1, T1 is skipped, but the T1 part in T2 is not skipped. T2 will match the input text in the lexer phase. (Even we put the T2 after T1, T2 will still match. I think ANTLR did some greedy match to match for the longest token.)
Grammar 2:
The skipped T1 is not referenced by another token rule, but directly in a parser rule.
grammar test;
t : '#' T1+ Char+; // <<<<<<<<<<<< HERE
p : t
EOF;
Char : [a-z];
fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.
This time, no T2 rule to help the spaces to survive the lexer phase, all T1 in the input file will be skipped. So when in the parser phase afterwards, the matching will fail with this error:
[@0,0:0='#',<'#'>,1:0]
[@1,4:4='a',<Char>,1:4]
[@2,5:5='b',<Char>,1:5]
[@3,6:6='c',<Char>,1:6]
[@4,7:6='<EOF>',<EOF>,1:7]
line 1:4 mismatched input 'a' expecting T1
Because all T1 are already discarded in lexer phase.
Back to my original question, the subtle mistake I made is, I thought after the TS
is skipped, the remaining characters can be re-grouped into the new token CommentLine
, which has no spaces. This is plain wrong with ANTLR.
Because lexer phase all happens before parser phase, the CommentLine
is a token rule, it has no spaces in it, so it won't match anything in the input content.
So just as @macmoonshine said, I do have to add TS
into the CommentLine
token.
Upvotes: 1
Views: 2115
Reputation: 151
Try this: It appears your comment is the same as a normal single line comment with the '#'
swapped for '//'
. If you require a space after the hash use: '# '
. If you require the hash to be in column 1 use: [\n\r] '# ' ~[\n\r]
. From looking at the example this should cover all the potential options.
COMMENT_LINE
: '#' ~[\n\r]* ->( skip )
;
Upvotes: 0
Reputation: 7409
Perhaps you're looking for :
grammar Header;
start : CommentLine+ EOF;
CommentLine : '#' ' ' AsciiChars+;
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
Now this one uses just a lexer rule.
grammar Header;
start : CommentLine+ EOF;
CommentLine : '#' ' ' AsciiChars+ -> skip;
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
This will ignore the commments entirely, and in fact gives an error as written because the rule 'start
expects a CommentLine
which now is discarded. So if you want to ignore and discard comments, use something like this second one and don't make mention of CommentLine
in your parser rules, just let the lexer strip them. Or if you want to preserve comments, you can use the previous one.
A final idea is to reroute comments to another channel:
grammar Header;
start : other EOF;
other: AsciiChars;
CommentLine : '#' ' ' AsciiChars+ -> channel(2);
AsciiChars : [a-zA-Z]+;
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
In this grammar, comments are still lexed, but routed to another channel for possible processing. And I added another rule in
start
just so there'd be something to match in:
# a
# b
something
# c
[@0,0:2='# a',<CommentLine>,channel=2,1:0]
[@1,5:7='# b',<CommentLine>,channel=2,2:0]
[@2,10:18='something',<AsciiChars>,3:0]
[@3,21:23='# c',<CommentLine>,channel=2,4:0]
[@4,26:25='<EOF>',<EOF>,5:0]
One of these options should surely do it for you ;)
Upvotes: 1
Reputation: 17721
Your grammar does not include spaces in comments, but your comments does.
EDIT: Have you tried commentLine : '#' TS AsciiChars;
as comment rule?
Upvotes: 1