Tar
Tar

Reputation: 9015

How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)?

I need to tokenize everything that is "outside" any comment, until end of line. For instance:

take me */ and me /* but not me! */ I'm in! // I'm not...

Tokenized as (STR is the "outside" string, BC is block-comment and LC is single-line-comment):

{
    STR: "take me */ and me ", // note the "*/" in the string!
    BC : " but not me! ",
    STR: " I'm in! ",
    LC : " I'm not..."
}

And:

/* starting with don't take me */ ...take me...

Tokenized as:

{
    BC : " starting with don't take me ",
    STR: " ...take me..."
}

The problem is that STR can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR.

I thought maybe to do something like:

STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;

But I don't know how to look-ahead for characters in lexer actions.

Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?

Upvotes: 1

Views: 207

Answers (2)

Dennis Ashley
Dennis Ashley

Reputation: 151

Yes, it is possible to perform the tokenization you are attempting.

Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.

I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.

/*
    All 3 COMMENTS are Mutually Exclusive
 */
DOC_COMMENT
        : '/**'
          ( [*]* ~[*/]         // Cannot START/END Comment
            ( DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            | .
            )*?
          )?
          '*'+ '/' -> channel( DOC_COMMENT )
        ;
BLK_COMMENT
        : '/*'
          (
            ( /* Must never match an '*' in position 3 here, otherwise
                 there is a conflict with the definition of DOC_COMMENT
               */
              [/]? ~[*/]       // No START/END Comment
            | DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            )
            ( DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            | .
            )*?
          )?
          '*/' -> channel( BLK_COMMENT )
        ;
INL_COMMENT
        : '//'
          ( ~[\n\r*/]          // No NEW_LINE
          | INL_COMMENT        // Nested Inline Comment
          )* -> channel( INL_COMMENT )
        ;
STR       // Consume everthing up to the start of a COMMENT
        : ( ~'/'      // Any Char not used to START a Comment
          | '/' ~[*/] // Cannot START a Comment
          )+
        ;

start
        : DOC_COMMENT
        | BLK_COMMENT
        | INL_COMMENT
        | STR
        ;

Upvotes: 2

Bart Kiers
Bart Kiers

Reputation: 170158

Try something like this:

grammar T;

@lexer::members {

  // Returns true iff either "//" or "/*"  is ahead in the char stream.
  boolean startCommentAhead() {
    return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
  }
}

// other rules

STR
 : ( {!startCommentAhead()}? . )+
 ;

Upvotes: 1

Related Questions