Is it possible to split a token into 2 in Antlr4?

Question

I need to be able to split one token into 2 for highlighting purposes, I have a token that looks like this:

ID_INTERP: '$' IDEN;

but I want to highlight the dollar sign differently from the identifier, so is it possible to split this token into two, one with the dollar sign and the other with the identifier? I know I can change the entire token into a different type under certain conditions, but I'd like to be able to add and change what text it contains, basically to change the tokenstream so instead of saying

ID_INTERP["$foo"]

it would see something like this:

DOLLAR_SIGN["$"] IDEN["foo"]

Mike Lischke · Accepted Answer

It is possible by extending your token source to emit more than a single token for a given match. I have used this idea to generate 2 tokens for the lexer rule DOT_IDENTIFIER (see the MySQL grammar in the MySQL Workbench parser). On match it pushes a dot token and sets the result to IDENTIFIER, effectively creating 2 separate tokens for a single rule.

Sam Harwell described the technique to extend your lexer for this approach in his answer with some Java code. And here is a possible C++ implementation that I'm using:

std::unique_ptr MySQLBaseLexer::nextToken() {
  // First respond with pending tokens to the next token request, if there are any.
  if (!_pendingTokens.empty()) {
    auto pending = std::move(_pendingTokens.front());
    _pendingTokens.pop_front();
    return pending;
  }

  // Let the main lexer class run the next token recognition.
  // This might create additional tokens again.
  auto next = Lexer::nextToken();
  if (!_pendingTokens.empty()) {
    auto pending = std::move(_pendingTokens.front());
    _pendingTokens.pop_front();
    _pendingTokens.push_back(std::move(next));
    return pending;
  }
  return next;
}

Is it possible to split a token into 2 in Antlr4?

Answers (1)

Related Questions