zurk
zurk

Reputation: 103

ANTLR : Tree parsing giving unexpected results

I'm trying to create my own ANTLR4 grammar so that I can parse some expressions, such as :

STARTS_WITH(this.sequence, "startSeq")

ENDS_WITH("sequenceToTest", "endSeq")

(I have a mix of constants and variables as function parameters).

The Lexer and Parser Rules have been created, but when I try to display the function ToStringTree, it seems like the tree is parsed in a wrong way.

I am using Antlr 4 with ASP.Net Core 2.1.

The grammar file :

grammar expression;

expression : bool_function (OPERATOR bool_function)* ;

bool_function : (FUNCTION_NAME PAR_OPEN parameter (COMMA parameter)* PAR_CLOSE) ;

parameter : constant | current_object_field ;

constant : DOUBLE_QUOTE ANY_BUT_DOUBLE_QUOTE DOUBLE_QUOTE ;

current_object_field : THIS ALPHANUM ;

WHITESPACE : (' ' | '\t' | '\n' |'\r' )+ -> skip ;

COMPARATOR : ('==' | '<=' | '>=' | '<>') ;

OPERATOR : ('&&' | '||') ;

PAR_OPEN : '(' ;

PAR_CLOSE : ')' ;

COMMA : ',' ;

DOLLAR : '$' ;

DOUBLE_QUOTE : '"' ;

THIS : 'this.' ;

NUMBER : [0-9]+ ;

ALPHANUM : [a-zA-Z_0-9]+ ;

ANY_BUT_DOUBLE_QUOTE    : ~('"')+ ;

FUNCTION_NAME : ('STARTS_WITH' | 'ENDS_WITH' | 'CONTAINS' | 'EQUALS') ;

The unit testing, trying to parse one basic expression :

        try
        {
            expressionParser expressionParser = Setup("STARTS_WITH(this.sequence, \"startSeq\")");
            expressionParser.ExpressionContext expressionContext = expressionParser.expression();

            var a = expressionContext.ToStringTree(expressionParser);
        }
        catch (Exception e)
        {
            var a = e.Message;
            throw;
        }

The output I receive for parsing my expression via ToStringTree is the following :

(expression (bool_function STARTS_WITH(this.sequence,  " startSeq " )))

But I would expect a result, going deeper, such as :

(expression (bool_function STARTS_WITH((parameter(current_objet_field this.sequence)),(parameter(constant "startSeq")))))

Are there any obvious mistakes in the way I defined my Lexer/Parser ?

Upvotes: 0

Views: 497

Answers (1)

sepp2k
sepp2k

Reputation: 370212

You don't get the tree you want because your input isn't parsed correctly at all. The parse fails with the following syntax error:

line 1:0 mismatched input 'STARTS_WITH(this.sequence, ' expecting FUNCTION_NAME

So the first thing you should check, is why you didn't see the error message. The default error listener will print syntax errors to stderr. If you're running in an environment where you don't see stderr, you should install your own error listener that informs the user of syntax errors in the input in a way that will be noticeable.

Now why does the input not parse? Well, the error message seems suspicious because it's complaining about a missing FUNCTION_NAME at the beginning when that's exactly what STARTS_WITH is (or should be, at least). It also seems to treat STARTS_WITH(this.sequence, as a single token, which is clearly not what we want. So something seems to be amiss with your lexer rules.

The first thing you should do when you think the lexer might be producing the wrong tokens, is to actually print out the tokens produced by the lexer. You can do that using grun with the -tokens option (that requires you to go through Java, but that's not much of a problem since your grammar contains no actions) or by iterating over the token stream in your C# code (note that you'll have to reset the stream after iterating over it or else the parser will only see an empty stream).

By doing this, we'll see that the following tokens have been produced by the lexer:

[@0,0:26='STARTS_WITH(this.sequence, ',<ANY_BUT_DOUBLE_QUOTE>,1:0]
[@1,27:27='"',<'"'>,1:27]
[@2,28:35='startSeq',<ALPHANUM>,1:28]
[@3,36:36='"',<'"'>,1:36]
[@4,37:37=')',<')'>,1:37]
[@5,38:37='<EOF>',<EOF>,1:38]

Now the first problem we can see here is the ANY_BUT_DOUBLE_QUOTE token in the beginning. Clearly we want this to be multiple tokens, none of which are ANY_BUT_DOBULE_QUOTE. This happens because ANY_BUT_DOUBLE_QUOTE can match the entire string STARTS_WITH(this.sequence, whereas FUNCTION_NAME would only match STARTS_WITH. ANTLR-generated lexers follow the maximum munch rule, which says to always use the rule that produces the longest match (using the one that comes first in the grammar in case of ties).

The other problem is startSeq being an ALPHANUM since the only thing that your grammar allows between two double quotes is ANY_BUT_DOUBLE_QUOTE. Here the lexer produced an ALPHANUM instead of ANY_BUT_DOUBLE_QUOTE because both rules would have produced a match of the same length, but ALPHANUM comes first in the grammar. Note that if you simply switch around the order of ALPHANUM and ANY_BUT_DOUBLE_QUOTE, the lexer will never produce an ALPHANUM token, which is also not what you want.

Both of these issues stem from the fact that ANY_BUT_DOUBLE_QUOTE can match basically anything and thus overlaps with most of your other rules. This is a bad thing.

What you should be doing instead is to have a single lexer rule for string literals (so turn constant into a lexer rule and turn DOUBLE_QUOTE and ANY_BUT_DOUBLE_QUOTE into fragments or inline them directly into CONSTANT). That way there's no longer an ANY_BUT_DOUBLE_QUOTE rule that conflicts with everything and CONSTANT won't conflict with anything because it's the only rule that starts with a double quotes. This will also prevent white space from being discarded inside double quotes.

Once you've done that, you'll get an error about STARTS_WITH being an ALPHANUM instead of a FUNCTION_NAME, but you can fix that by moving FUNCTION_NAME before ALPHANUM in the grammar. Note that means that your function names can never be used as the names of members. If you don't want that, you should not make function names keywords (i.e. you should just allow arbitrary identifiers as function names and then check later whether you know a function with that name) or make them contextual keywords (have a parser rule that can match either an ALPHANUM or any of the function names).

Upvotes: 1

Related Questions