Reputation: 3209
I want a token rule that gobbles up all characters until it gets to the characters XYZ
.
Thus, if the input is this:
helloXYZ
then the token rule should return this token:
hello
If the input is this:
Blah Blah XYZ
then the token rule should return this token:
Blah Blah
How do I define a token rule to do this?
Upvotes: 2
Views: 1965
Reputation: 99999
If you want good performance, you need to use a form which does not use predicates. I would use code modeled after PositionAdjustingLexer.g4 to reset the position if the token ends with XYZ
.
Edit: Don't underestimate the performance hit of the answer using a semantic predicate. The predicate will be evaluated at least once for every character of your entire input stream, and any character where a predicate is evaluated is prevented from using the DFA. The last time I saw something like this in use, it was responsible for more than 95% of the execution time of the entire parsing process, and removing it improved performance from more 20 seconds to less than 1 second.
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterAtEOF
: . EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacter
: .
-> more
;
If you want even better performance, you can add a couple rules to optimize handling of sequences that do not contain any X
characters:
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterSpanAtEOF
: ~'X'+ EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacterSpan
: ~'X'+
-> more
;
SpecialTokenXAtEOF
: 'X' EOF
-> type(SpecialToken), popMode
;
SpecialTokenX
: 'X'
-> more
;
Upvotes: 2
Reputation: 13416
Using the hint that Terrance gives in his answer, I think this is what Roger is looking for:
grammar UseLookahead;
parserRule : LexerRule;
LexerRule : .+? { (_input.LA(1) == 'X') &&
(_input.LA(2) == 'Y') &&
(_input.LA(3) == 'Z')
}?
;
This gives the answers required, hello
and Blah Blah
respectively. I confess that I don't understand the significance of the final ?
.
Upvotes: 2