Roger Costello
Roger Costello

Reputation: 3209

ANTLR 4 token rule that matches any characters until it encounters XYZ

I want a token rule that gobbles up all characters until it gets to the characters XYZ.

Thus, if the input is this:

helloXYZ

then the token rule should return this token:

hello

If the input is this:

Blah Blah XYZ

then the token rule should return this token:

Blah Blah

How do I define a token rule to do this?

Upvotes: 2

Views: 1965

Answers (3)

Sam Harwell
Sam Harwell

Reputation: 99999

If you want good performance, you need to use a form which does not use predicates. I would use code modeled after PositionAdjustingLexer.g4 to reset the position if the token ends with XYZ.

Edit: Don't underestimate the performance hit of the answer using a semantic predicate. The predicate will be evaluated at least once for every character of your entire input stream, and any character where a predicate is evaluated is prevented from using the DFA. The last time I saw something like this in use, it was responsible for more than 95% of the execution time of the entire parsing process, and removing it improved performance from more 20 seconds to less than 1 second.

tokens {
  SpecialToken
}

mode SpecialTokenMode;

  // In your position adjusting lexer, if you see a token with the type
  // SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
  // the type to SpecialToken
  SpecialTokenWithXYZ
    : 'XYZ'
      -> popMode
    ;

  SpecialTokenCharacterAtEOF
    : . EOF
      -> type(SpecialToken), popMode
    ;

  SpecialTokenCharacter
    : .
      -> more
    ;

If you want even better performance, you can add a couple rules to optimize handling of sequences that do not contain any X characters:

tokens {
  SpecialToken
}

mode SpecialTokenMode;

  // In your position adjusting lexer, if you see a token with the type
  // SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
  // the type to SpecialToken
  SpecialTokenWithXYZ
    : 'XYZ'
      -> popMode
    ;

  SpecialTokenCharacterSpanAtEOF
    : ~'X'+ EOF
      -> type(SpecialToken), popMode
    ;

  SpecialTokenCharacterSpan
    : ~'X'+
      -> more
    ;

  SpecialTokenXAtEOF
    : 'X' EOF
      -> type(SpecialToken), popMode
    ;

  SpecialTokenX
    : 'X'
      -> more
    ;

Upvotes: 2

Terence Parr
Terence Parr

Reputation: 5962

How about this?

HELLO : 'hello' {_input.LA(1)!=' '}? ;

Upvotes: 1

james.garriss
james.garriss

Reputation: 13416

Using the hint that Terrance gives in his answer, I think this is what Roger is looking for:

grammar UseLookahead;

parserRule : LexerRule;

LexerRule : .+? { (_input.LA(1) == 'X') &&
                  (_input.LA(2) == 'Y') &&
                  (_input.LA(3) == 'Z') 
                }?
          ;

This gives the answers required, hello and Blah Blah respectively. I confess that I don't understand the significance of the final ?.

Upvotes: 2

Related Questions