Reputation: 142
I'm currently working on a ANTLR lexer rule of supporting C-style comment. There has been a widely recommended rule for such a goal:
C_COMMENT
:
'/*' (options {greedy=false;}: .)* '*/'
{ $channel=HIDDEN; }
;
However what I want is an alternative: '+' is not allowed to be the first non-space character of the comment body, e.g. /* +blablabla*/ is not a valid comment. Then I tried something like this:
C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')* ~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/'
{ $channel=HIDDEN; }
;
And it nearly worked, except for empty comments /* */. So I tried something like this:
C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')*
(
'*/'
|
(~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/')
)
{ $channel=HIDDEN; }
;
It, and a bunch of similar ones that I didn't list, never worked. The * / in /* */ always falls into ~(' '|'\r'|'\t'|'\n'|'+') part.
Finally I got something working like this:
C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')* '*/'
{ $channel=HIDDEN; }
|
'/*' (' '|'\r'|'\t'|'\n')*
(
'*/'
|
(~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/')
)
{ $channel=HIDDEN; }
;
Though ANTLR warns that patterns like /* */ can match both alternatives.
Could anyone help me to understand all of this? I mean, why nothing above the last one worked.
Thanks in advance.
Upvotes: 2
Views: 974
Reputation: 170308
Why not do something like this:
grammar T;
parse
: ( c_comment
| plus_comment
)*
EOF
;
c_comment
: C_COMMENT
;
plus_comment
: PLUS_COMMENT
;
PLUS_COMMENT
: '/*' S* '+' .* '*/'
;
C_COMMENT
: '/*' .* '*/'
;
SPACES
: S+ {skip();}
;
fragment S
: ' ' | '\t' | '\r' | '\n'
;
which will parse the input:
/**/ /* + as*/ /* sdcdcds sdcds */
as follows:
The trick here is to define PLUS_COMMENT
before C_COMMENT
. That way, if the lexer stumbles on "/* s"
, it falls back from a PLUS_COMMENT
to a C_COMMENT
because it cannot match the +
.
Upvotes: 2