dposada
dposada

Reputation: 899

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:

grammar SimpleGrammar;

AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;

fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];

// Parser rules
expr:
    expr AND expr
    | '"' phrase '"'
    | TERM
    | TRUNCATION
    ;

phrase:
    (TERM | PHRASE_TERM | TRUNCATION)+
    ;

The above grammar works when parsing a! & b, which correctly parses to:

  AND
  / \
 /   \
a!    b

However, when I attempt to parse "a! & b", I get:

line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}

The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.

Is this possible?

Upvotes: 2

Views: 300

Answers (1)

Divisadero
Divisadero

Reputation: 913

It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.

In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.

LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;

For more information google 'ANTLR lexer Mode'

Upvotes: 2

Related Questions