zanmato
zanmato

Reputation: 142

ANTLR lexer for C-style comment

I'm currently working on a ANTLR lexer rule of supporting C-style comment. There has been a widely recommended rule for such a goal:

C_COMMENT
:
'/*' (options {greedy=false;}: .)* '*/'
{ $channel=HIDDEN; }
;

However what I want is an alternative: '+' is not allowed to be the first non-space character of the comment body, e.g. /* +blablabla*/ is not a valid comment. Then I tried something like this:

C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')* ~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/'
{ $channel=HIDDEN; }
;

And it nearly worked, except for empty comments /* */. So I tried something like this:

C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')*
(
'*/'
|
(~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/')
)
{ $channel=HIDDEN; }
;

It, and a bunch of similar ones that I didn't list, never worked. The * / in /* */ always falls into ~(' '|'\r'|'\t'|'\n'|'+') part.

Finally I got something working like this:

C_COMMENT
:
'/*' (' '|'\r'|'\t'|'\n')* '*/'
{ $channel=HIDDEN; }
|
'/*' (' '|'\r'|'\t'|'\n')*
(
'*/'
|
(~(' '|'\r'|'\t'|'\n'|'+') (options {greedy=false;}: .)* '*/')
)
{ $channel=HIDDEN; }
;

Though ANTLR warns that patterns like /* */ can match both alternatives.

Could anyone help me to understand all of this? I mean, why nothing above the last one worked.

Thanks in advance.

Upvotes: 2

Views: 974

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170308

Why not do something like this:

grammar T;

parse
 : ( c_comment
   | plus_comment
   )* 
   EOF
 ;

c_comment
 : C_COMMENT
 ;

plus_comment
 : PLUS_COMMENT
 ;

PLUS_COMMENT
 : '/*' S* '+' .* '*/'
 ;

C_COMMENT
 : '/*' .* '*/'
 ;

SPACES
 : S+ {skip();}
 ;

fragment S
 : ' ' | '\t' | '\r' | '\n'
 ;

which will parse the input:

/**/
/*       + as*/
/*  sdcdcds      sdcds */

as follows:

enter image description here

The trick here is to define PLUS_COMMENT before C_COMMENT. That way, if the lexer stumbles on "/* s", it falls back from a PLUS_COMMENT to a C_COMMENT because it cannot match the +.

Upvotes: 2

Related Questions