Reputation: 29632

How do I correctly parse generic types with ANTLR?

I am using ANTLR v3 and I want to correctly parse template expressions like:

List<List<int>>

The problem is that the two closing greater than signs conflict with the shift right operator. I've seen a few solutions, but most of them look like workarounds, e.g. not having the shift right at all, but just the greater than. I would like to just have this parse correctly in the first place.

Upvotes: 4

Answers (2)

Pieter van Ginkel

Reputation: 29632

The standard solution to this problem is to not have a '>>' token at all. Instead, you only have the '>' token and in the parser you match two consecutive '>''s. Then, to make sure that you didn't match something else, you perform a check afterwards to verify that it really was a '>>'. See the answer by @280Z28 for an explanation on how to do it that way.

The solution below is an alternative approach. This approach allows you to actually have a '>>' token in your lexer. Basically, how it works is that the single '>>' token is emitted as two virtual tokens. These two tokens can be matched separately or together (just like the standard solution). The major difference is that it is not possible to have anything between these two tokens like, and thus not require you to verify whether you actually matched '>>' instead of something else.

First you have to adjust the lexer to support emitting multiple tokens. See How can I emit more than a single token per lexer rule? on how to accomplish this.

Next the lexer rules. We have two lexer rules. The lexer rule for the greater than token is just like you would normally do this:

OP_GREATER_THAN : '>' ;

The shift right is where the magic happens. Instead of emitting a shift right token, we emit multiple tokens:

OP_GREATER_THAN_GREATER_THAN : t='>>' { emitGreaterThanGreaterThan(t); } ;

The emitGreaterThanGreaterThan() method emits these:

private void emitGreaterThanGreaterThan(Token token) {
    // Split the greater than token into two tokens so they can be matched separately.

    CommonToken first = new CommonToken(token);
    first.setType(OP_GREATER_THAN_GREATER_THAN_FIRST);
    emit(first);

    CommonToken second = new CommonToken(token);
    second.setType(OP_GREATER_THAN_GREATER_THAN_SECOND);
    emit(second);
}

This method can be added to the @lexer::members section in the lexer. What this method does is that it takes the '>>' token, copies it two times and changes the types. It turns the single shift right token into a OP_GREATER_THAN_GREATER_THAN_FIRST and OP_GREATER_THAN_GREATER_THAN_SECOND. These tokens must be declared, so the following lexer rules must be added:

fragment OP_GREATER_THAN_GREATER_THAN_FIRST : ;
fragment OP_GREATER_THAN_GREATER_THAN_SECOND : ;

These lexer rules don't match anything, but are here just so that we can reference them.

To make using this construct a little bit easier, a few parsing rules can be added:

op_GREATER_THAN_GREATER_THAN
    :
        OP_GREATER_THAN_GREATER_THAN_FIRST
        OP_GREATER_THAN_GREATER_THAN_SECOND
    ;

op_GREATER_THAN_ANY
    : OP_GREATER_THAN
    | OP_GREATER_THAN_GREATER_THAN_FIRST
    | OP_GREATER_THAN_GREATER_THAN_SECOND
    ;

The parsing rule op_GREATER_THAN_GREATER_THAN matches the first token followed by the second. This becomes the shift right token. The op_GREATER_THAN_ANY rule matches any greater than token. This is used to parse the generic type.

The generic type rules looks as follows:

genericTypeArguments
    :
        OP_LESS_THAN
        genericTypeArgument
        (
            OP_COMMA
            genericTypeArgument
        )*
        op_GREATER_THAN_ANY
    ;

And that is where the magic happens. Because we've split up the '>>' token into multiple tokens, we can match either a bare greater than token, the first part of the shift right token and the second part of the shift right token.

Upvotes: 0

Sam Harwell

Reputation: 99879

In some cases, the input >> is two tokens (two right angle braces closing the generic type arguments), and in other cases it is a single operator. To handle these properly, always treat a single right angle brace as a single token in the lexer, and use parser rules to distinguish between the cases. One of the example Java grammars does exactly that. You can use code to verify that the shift operator does not contain any extraneous characters between the angle braces without affecting the portability of the grammar itself.

Lexer rules

GT : '>';
LT : '<';

Parser rules

typeArguments
    :   '<' typeArgument (',' typeArgument)* '>'
    ;

shiftOp
    :   '<' '<'
    |   '>' '>' '>'
    |   '>' '>'
    ;

Code (ANTLR 3)

I don't have a specific example here, since the implementation would depend on the intermediate format used by the grammar (i.e. output=AST or some other format). I strongly recommend all new development be done using ANTLR 4 for too many reasons to list.

Code (ANTLR 4)

Note: In ANTLR 4 this could be performed in a listener or visitor. I arbitrarily used a listener method here.

@Override
public void enterShiftOp(ShiftOpContext ctx) {
  Token token = ((TerminalNode)ctx.getChild(0)).getSymbol();
  int firstIndex = token.getStartIndex();
  for (int i = 1; i < ctx.getChildCount(); i++) {
    Token sibling = ((TerminalNode)ctx.getChild(i)).getSymbol();
    if (sibling.getStartIndex() != firstIndex + i) {
      // report error: spaces/comments cannot appear between chars of a shift operator
    }
  }
}