Reputation: 482
I'm trying to parse expressions from the Jakarta Expression Language. In summary, it is a simplified Java expressions, with addition of a few things:
{"foo": "bar"}
[1,2,3,4]
{1,2,3,4}
foo gt bar
(foo > bar
), foo mod bar
(foo % bar)
, and so on.I'm struggling in the last bit, where it always understands the "mod", "gt", "ge" as identifiers instead of using the expression that has the "%", ">", ">=".
I'm new to ANTLR. My grammar is based on the Java grammar in the https://github.com/antlr/grammars-v4/tree/master/java/java and the JavaCC provided by: https://jakarta.ee/specifications/expression-language/4.0/jakarta-expression-language-spec-4.0.html#collected-syntax
grammar ExpressionLanguageGrammar;
prog: compositeExpression;
compositeExpression: (dynamicExpression | deferredExpression | literalExpression)*;
dynamicExpression: '${' expression RCURL;
deferredExpression: '#{' expression RCURL;
literalExpression: literal;
literal: BOOL_LITERAL | FLOATING_POINT_LITERAL | INTEGER_LITERAL | StringLiteral | NULL;
mapData | listData | setData;
methodArguments: LPAREN expressionList? RPAREN;
expressionList: (expression ((COMMA expression)*));
lambdaExpressionOrCall: LPAREN lambdaExpression RPAREN methodArguments*;
lambdaExpression: lambdaParameters ARROW expression;
lambdaParameters: IDENTIFIER | (LPAREN (IDENTIFIER ((COMMA IDENTIFIER)*))? RPAREN);
mapEntry: expression COLON expression;
mapEntries: mapEntry (COMMA mapEntry)*;
expression
: primary
|'[' expressionList? ']'
| '{' expressionList? '}'
| '{' mapEntries? '}'
| expression bop='.' (IDENTIFIER | IDENTIFIER '(' expressionList? ')')
| expression ('[' expression ']')+
| prefix=('-' | '!' | NOT1 | EMPTY) expression
| expression bop=('*' | '/' | '%' | MOD1 | DIV1) expression
| expression bop=('+' | '-') expression
| expression bop=('<=' | '>=' | '>' | '<' | LE1 | GE1 | LT1 | GT1) expression
| expression bop=INSTANCEOF IDENTIFIER
| expression bop=('==' | '!=' | EQ1 | NE1) expression
| expression bop=('&&' | AND1) expression
| expression bop=('||' | OR1) expression
| <assoc=right> expression bop='?' expression bop=':' expression
| <assoc=right> expression
bop=('=' | '+=' | '-=' | '*=' | '/=')
expression
| lambdaExpression
| lambdaExpressionOrCall
;
primary
: '(' expression ')'
| literal
| IDENTIFIER
;
BOOL_LITERAL: TRUE | FALSE;
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER_LITERAL: [0-9]+;
FLOATING_POINT_LITERAL: [0-9]+ '.' [0-9]* EXPONENT? | '.' [0-9]+ EXPONENT? | [0-9]+ EXPONENT?;
fragment EXPONENT: ('e'|'E') ('+'|'-')? [0-9]+;
StringLiteral: ('"' DoubleStringCharacter* '"'
| '\'' SingleStringCharacter* '\'') ;
fragment DoubleStringCharacter
: ~["\\\r\n]
| '\\' EscapeSequence
;
fragment SingleStringCharacter
: ~['\\\r\n]
| '\\' EscapeSequence
;
fragment EscapeSequence
: CharacterEscapeSequence
| '0'
| HexEscapeSequence
| UnicodeEscapeSequence
| ExtendedUnicodeEscapeSequence
;
fragment CharacterEscapeSequence
: SingleEscapeCharacter
| NonEscapeCharacter
;
fragment HexEscapeSequence
: 'x' HexDigit HexDigit
;
fragment UnicodeEscapeSequence
: 'u' HexDigit HexDigit HexDigit HexDigit
| 'u' '{' HexDigit HexDigit+ '}'
;
fragment ExtendedUnicodeEscapeSequence
: 'u' '{' HexDigit+ '}'
;
fragment SingleEscapeCharacter
: ['"\\bfnrtv]
;
fragment NonEscapeCharacter
: ~['"\\bfnrtv0-9xu\r\n]
;
fragment EscapeCharacter
: SingleEscapeCharacter
| [0-9]
| [xu]
;
fragment HexDigit
: [_0-9a-fA-F]
;
fragment DecimalIntegerLiteral
: '0'
| [1-9] [0-9_]*
;
fragment ExponentPart
: [eE] [+-]? [0-9_]+
;
fragment IdentifierPart
: IdentifierStart
| [\p{Mn}]
| [\p{Nd}]
| [\p{Pc}]
| '\u200C'
| '\u200D'
;
fragment IdentifierStart
: [\p{L}]
| [$_]
| '\\' UnicodeEscapeSequence
;
LCURL: '{';
RCURL: '}';
LETTER: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff';
DIGIT: '\u0030'..'\u0039'|
'\u0660'..'\u0669'|
'\u06f0'..'\u06f9'|
'\u0966'..'\u096f'|
'\u09e6'..'\u09ef'|
'\u0a66'..'\u0a6f'|
'\u0ae6'..'\u0aef'|
'\u0b66'..'\u0b6f'|
'\u0be7'..'\u0bef'|
'\u0c66'..'\u0c6f'|
'\u0ce6'..'\u0cef'|
'\u0d66'..'\u0d6f'|
'\u0e50'..'\u0e59'|
'\u0ed0'..'\u0ed9'|
'\u1040'..'\u1049';
TRUE: 'true';
FALSE: 'false';
NULL: 'null';
DOT: '.';
LPAREN: '(';
RPAREN: ')';
LBRACK: '[';
RBRACK: ']';
COLON: ':';
COMMA: ',';
SEMICOLON: ';';
GT0: '>';
GT1: 'gt';
LT0: '<';
LT1: 'lt';
GE0: '>=';
GE1: 'ge';
LE0: '<=';
LE1: 'le';
EQ0: '==';
EQ1: 'eq';
NE0: '!=';
NE1: 'ne';
NOT0: '!';
NOT1: 'not';
AND0: '&&';
AND1: 'and';
OR0: '||';
OR1: 'or';
EMPTY: 'empty';
INSTANCEOF: 'instanceof';
MULT: '*';
PLUS: '+';
MINUS: '-';
QUESTIONMARK: '?';
DIV0: '/';
DIV1: 'div';
MOD0: '%';
MOD1: 'mod';
CONCAT: '+=';
ASSIGN: '=';
ARROW: '->';
DOLLAR: '$';
HASH: '#';
WS: [ \t\r\n]+ -> skip;
Upvotes: 0
Views: 64
Reputation: 6785
Move the Lexer rules for them to be prior to the Lexer rule for Identifier
.
If ANTLR has more than one Lexer rule that matches input of the same length it chooses the first rule in the grammar that matches.
For example “mod” is matched by Identifier
and MOD1
, but Identifier
is 1st, so it chooses Identifier
. Move the MOD1
rule to be before Identifier
and it’ll match MOD1
———-
BTW, unless you care about having different token values for “%” and “mod”, you can just define a single rule:
MOD: ‘%’ | ‘mod’;
You’d can still get the token text if you need it but it will you can just specify MOD
in your parser rules instead of (MOD0 | MOD1)
Upvotes: 1