Reputation: 103
I'm trying to create my own ANTLR4 grammar so that I can parse some expressions, such as :
STARTS_WITH(this.sequence, "startSeq")
ENDS_WITH("sequenceToTest", "endSeq")
(I have a mix of constants and variables as function parameters).
The Lexer and Parser Rules have been created, but when I try to display the function ToStringTree, it seems like the tree is parsed in a wrong way.
I am using Antlr 4 with ASP.Net Core 2.1.
The grammar file :
grammar expression;
expression : bool_function (OPERATOR bool_function)* ;
bool_function : (FUNCTION_NAME PAR_OPEN parameter (COMMA parameter)* PAR_CLOSE) ;
parameter : constant | current_object_field ;
constant : DOUBLE_QUOTE ANY_BUT_DOUBLE_QUOTE DOUBLE_QUOTE ;
current_object_field : THIS ALPHANUM ;
WHITESPACE : (' ' | '\t' | '\n' |'\r' )+ -> skip ;
COMPARATOR : ('==' | '<=' | '>=' | '<>') ;
OPERATOR : ('&&' | '||') ;
PAR_OPEN : '(' ;
PAR_CLOSE : ')' ;
COMMA : ',' ;
DOLLAR : '$' ;
DOUBLE_QUOTE : '"' ;
THIS : 'this.' ;
NUMBER : [0-9]+ ;
ALPHANUM : [a-zA-Z_0-9]+ ;
ANY_BUT_DOUBLE_QUOTE : ~('"')+ ;
FUNCTION_NAME : ('STARTS_WITH' | 'ENDS_WITH' | 'CONTAINS' | 'EQUALS') ;
The unit testing, trying to parse one basic expression :
try
{
expressionParser expressionParser = Setup("STARTS_WITH(this.sequence, \"startSeq\")");
expressionParser.ExpressionContext expressionContext = expressionParser.expression();
var a = expressionContext.ToStringTree(expressionParser);
}
catch (Exception e)
{
var a = e.Message;
throw;
}
The output I receive for parsing my expression via ToStringTree is the following :
(expression (bool_function STARTS_WITH(this.sequence, " startSeq " )))
But I would expect a result, going deeper, such as :
(expression (bool_function STARTS_WITH((parameter(current_objet_field this.sequence)),(parameter(constant "startSeq")))))
Are there any obvious mistakes in the way I defined my Lexer/Parser ?
Upvotes: 0
Views: 497
Reputation: 370212
You don't get the tree you want because your input isn't parsed correctly at all. The parse fails with the following syntax error:
line 1:0 mismatched input 'STARTS_WITH(this.sequence, ' expecting FUNCTION_NAME
So the first thing you should check, is why you didn't see the error message. The default error listener will print syntax errors to stderr. If you're running in an environment where you don't see stderr, you should install your own error listener that informs the user of syntax errors in the input in a way that will be noticeable.
Now why does the input not parse? Well, the error message seems suspicious because it's complaining about a missing FUNCTION_NAME
at the beginning when that's exactly what STARTS_WITH
is (or should be, at least). It also seems to treat STARTS_WITH(this.sequence,
as a single token, which is clearly not what we want. So something seems to be amiss with your lexer rules.
The first thing you should do when you think the lexer might be producing the wrong tokens, is to actually print out the tokens produced by the lexer. You can do that using grun
with the -tokens
option (that requires you to go through Java, but that's not much of a problem since your grammar contains no actions) or by iterating over the token stream in your C# code (note that you'll have to reset the stream after iterating over it or else the parser will only see an empty stream).
By doing this, we'll see that the following tokens have been produced by the lexer:
[@0,0:26='STARTS_WITH(this.sequence, ',<ANY_BUT_DOUBLE_QUOTE>,1:0]
[@1,27:27='"',<'"'>,1:27]
[@2,28:35='startSeq',<ALPHANUM>,1:28]
[@3,36:36='"',<'"'>,1:36]
[@4,37:37=')',<')'>,1:37]
[@5,38:37='<EOF>',<EOF>,1:38]
Now the first problem we can see here is the ANY_BUT_DOUBLE_QUOTE
token in the beginning. Clearly we want this to be multiple tokens, none of which are ANY_BUT_DOBULE_QUOTE
. This happens because ANY_BUT_DOUBLE_QUOTE
can match the entire string STARTS_WITH(this.sequence,
whereas FUNCTION_NAME
would only match STARTS_WITH
. ANTLR-generated lexers follow the maximum munch rule, which says to always use the rule that produces the longest match (using the one that comes first in the grammar in case of ties).
The other problem is startSeq
being an ALPHANUM
since the only thing that your grammar allows between two double quotes is ANY_BUT_DOUBLE_QUOTE
. Here the lexer produced an ALPHANUM
instead of ANY_BUT_DOUBLE_QUOTE
because both rules would have produced a match of the same length, but ALPHANUM
comes first in the grammar. Note that if you simply switch around the order of ALPHANUM
and ANY_BUT_DOUBLE_QUOTE
, the lexer will never produce an ALPHANUM
token, which is also not what you want.
Both of these issues stem from the fact that ANY_BUT_DOUBLE_QUOTE
can match basically anything and thus overlaps with most of your other rules. This is a bad thing.
What you should be doing instead is to have a single lexer rule for string literals (so turn constant
into a lexer rule and turn DOUBLE_QUOTE
and ANY_BUT_DOUBLE_QUOTE
into fragments or inline them directly into CONSTANT
). That way there's no longer an ANY_BUT_DOUBLE_QUOTE
rule that conflicts with everything and CONSTANT
won't conflict with anything because it's the only rule that starts with a double quotes. This will also prevent white space from being discarded inside double quotes.
Once you've done that, you'll get an error about STARTS_WITH
being an ALPHANUM
instead of a FUNCTION_NAME
, but you can fix that by moving FUNCTION_NAME
before ALPHANUM
in the grammar. Note that means that your function names can never be used as the names of members. If you don't want that, you should not make function names keywords (i.e. you should just allow arbitrary identifiers as function names and then check later whether you know a function with that name) or make them contextual keywords (have a parser rule that can match either an ALPHANUM
or any of the function names).
Upvotes: 1