Issue with syntax validations for certain scenarios with grammar

Question

I've written a basic grammar for creating simple expressions (add, mul, div, mod, min, and using functions with some arguments).

Grammar:

grammar ExpressionGrammar;

parse: expr EOF;

expr:
    MIN expr
    | expr ( MUL | DIV) expr
    | expr ( ADD | MIN) expr
    | expr  ( MOD )  expr
    | NUM
    | ID
    | STRING
    | function
    | '(' expr ')';

function: ID '(' arguments? ')';

arguments: expr ( ',' expr)*;

/* Tokens */

MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
MOD: '%';
OPEN_PAR: '(';
CLOSE_PAR: ')';

NUM: ([0-9]*[.])?[0-9]+;
STRING: '"' ~ ["]* '"';
fragment ID_NODE: [a-zA-Z_$][a-zA-Z0-9_$]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '/*' .*? '*/' -> skip;
LINE_COMMENT
    : '//' ~[
]* -> skip
    ;
WS: [ 
	
]+ -> skip;

Example valid expressions :

1. 1+2    - simple add, multiply, divide and mod operations. 
2. "adadfasdf"   -- simply return a string
3. 233.4234234    -- return a number
4. field.value1    -- field is a fieldType on which value exists ( field with multiple values) 
3. SUBSTR("asdfasdf","adsfasdff","asd");   -- a built in function ( such as SUBSTR, ADD, MIN, etc ) which takes arguments. 
4. ABS(field.value1)  
5. ADD( field.value1,field.value2) 
6. MAX( field.value1,field.value3, 2)
7. field.value1 % field.value2

8. It can even span multiple lines 

 ADD(
    field.value1,
    2,
    ABS(field.valu2)   -- can have another function inside this function. 
    )

ANTLR is throwing syntax errors when user inputs a wrong grammar for some scenarios but the below scenarios are not being shown as errors.

It is Allowing multiple closing brackets : trim(fields.value1))))))))
Allowing multiple expressions without any operators between them (doesn't throw any error when two fields are used without any operator between):

field.v1 field.v2
It is also not throwing errors when I have two different expressions in multiple lines (could be an extension to the above scenario).
```
 field.v1 + 2
 field.v2 + 3
```
It is not showing any error when there are multiple closing quotations - "adfadsfad""""""

All the above scenarios could be related to a single issue with my grammar? Not sure.

Could anyone help me correct my grammar to catch these errors?

Mike Cargal · Accepted Answer

These are classic symptoms of not having an EOF in your invoked rule.

If you don’t include EOF, then ANTLR4 will parse all matching input and quit (before the end of the input stream, if nothing else matches).

In each of your examples, if you look at it as “there’s something valid, followed by invalid input”, you’ll see that they all fit.

This is why @sepp2k asked about which rule you invoked.

It’s pretty clear that you likely invoked the expr rule rather than the parse rule (since the parse rule does have the EOF token included, it would continue to parse until it sees an EOF, and would report errors along the way).

When you generate ANTLR code, it generates methods/functions to start at any rule you want. There will be some code that you had to write to create an input stream, then create a Lexer from that stream, then create a Parser from the TokenStream you get from the Lexer. Finally, once, you have the parser object, you’ll choose which parser rule to invoke.

Change that rule from expr() to parse() and you should start getting the errors you expect.

The EOF in a parser rule, basically, says the entire input must match this rule (i.e. it has to also consume the “END OF FILE” token). Without that token, the rule can (and will) match anything that matches the rule and stop when it can no longer find valid input.

It’s pretty common to see only a single “top-level” or “entry” rule that ends with EOF. And that’s generally a clue that it’s the parser rule you call to kick things off.

A couple of exceptions:

Perhaps you WANT to only parse whatever you can match and ignore the rest. Then use a rule without an terminal EOF. (This probably rather rare).
Perhaps you have situations where you know you input should only be an expr, but you top level rule is something like stmts or compilationUnit. In this situation, you can create a exprOnly: expr EOF; rule and then call that rule to parse an expression that will fail unless the entire input matches your syntax for an expr. You’d not have any other rules that reference the exprOnly rule, effectively making it a top-level or entry rule. There’s nothing to prevent you from having multiple top-level rules to choose from.

An EOF in any rule that is referenced by another rule, but expects to match EOF, is likely (almost certainly) a mistake.

Note: no doubt, you will find grammars where the top-level rule does not end with an EOF token. This is considered a bad practice (for precisely the reason you’ve encountered; it can successfully parse without consuming all the input.

Issue with syntax validations for certain scenarios with grammar

Answers (1)

Related Questions