Reputation: 9015
I need to tokenize everything that is "outside" any comment, until end of line. For instance:
take me */ and me /* but not me! */ I'm in! // I'm not...
Tokenized as (STR
is the "outside" string, BC
is block-comment and LC
is single-line-comment):
{
STR: "take me */ and me ", // note the "*/" in the string!
BC : " but not me! ",
STR: " I'm in! ",
LC : " I'm not..."
}
And:
/* starting with don't take me */ ...take me...
Tokenized as:
{
BC : " starting with don't take me ",
STR: " ...take me..."
}
The problem is that STR
can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR
.
I thought maybe to do something like:
STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;
But I don't know how to look-ahead for characters in lexer actions.
Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?
Upvotes: 1
Views: 207
Reputation: 151
Yes, it is possible to perform the tokenization you are attempting.
Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.
I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.
/*
All 3 COMMENTS are Mutually Exclusive
*/
DOC_COMMENT
: '/**'
( [*]* ~[*/] // Cannot START/END Comment
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*'+ '/' -> channel( DOC_COMMENT )
;
BLK_COMMENT
: '/*'
(
( /* Must never match an '*' in position 3 here, otherwise
there is a conflict with the definition of DOC_COMMENT
*/
[/]? ~[*/] // No START/END Comment
| DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
)
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*/' -> channel( BLK_COMMENT )
;
INL_COMMENT
: '//'
( ~[\n\r*/] // No NEW_LINE
| INL_COMMENT // Nested Inline Comment
)* -> channel( INL_COMMENT )
;
STR // Consume everthing up to the start of a COMMENT
: ( ~'/' // Any Char not used to START a Comment
| '/' ~[*/] // Cannot START a Comment
)+
;
start
: DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| STR
;
Upvotes: 2
Reputation: 170158
Try something like this:
grammar T;
@lexer::members {
// Returns true iff either "//" or "/*" is ahead in the char stream.
boolean startCommentAhead() {
return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
}
}
// other rules
STR
: ( {!startCommentAhead()}? . )+
;
Upvotes: 1