My grammar identifiers keywords as identifiers

Question

I'm trying to parse expressions from the Jakarta Expression Language. In summary, it is a simplified Java expressions, with addition of a few things:

Support for creating maps like: {"foo": "bar"}
Support for creating lists and sets like: [1,2,3,4] {1,2,3,4}
Use some identifiers instead of symbols, like: foo gt bar (foo > bar), foo mod bar(foo % bar), and so on.

I'm struggling in the last bit, where it always understands the "mod", "gt", "ge" as identifiers instead of using the expression that has the "%", ">", ">=".

I'm new to ANTLR. My grammar is based on the Java grammar in the https://github.com/antlr/grammars-v4/tree/master/java/java and the JavaCC provided by: https://jakarta.ee/specifications/expression-language/4.0/jakarta-expression-language-spec-4.0.html#collected-syntax

grammar ExpressionLanguageGrammar;
prog: compositeExpression;

compositeExpression: (dynamicExpression | deferredExpression | literalExpression)*;
dynamicExpression: '${' expression RCURL;
deferredExpression: '#{' expression RCURL;
literalExpression: literal;

literal: BOOL_LITERAL | FLOATING_POINT_LITERAL | INTEGER_LITERAL | StringLiteral | NULL;
mapData | listData | setData;
methodArguments: LPAREN expressionList? RPAREN;
expressionList: (expression ((COMMA expression)*));
lambdaExpressionOrCall: LPAREN lambdaExpression RPAREN methodArguments*;
lambdaExpression: lambdaParameters ARROW expression;
lambdaParameters: IDENTIFIER | (LPAREN (IDENTIFIER ((COMMA IDENTIFIER)*))? RPAREN);

mapEntry: expression COLON expression;
mapEntries: mapEntry (COMMA mapEntry)*;

expression
    : primary
    |'[' expressionList? ']'
    | '{' expressionList? '}'
    | '{' mapEntries? '}'
    | expression bop='.' (IDENTIFIER | IDENTIFIER '(' expressionList? ')')
    | expression ('[' expression ']')+
    | prefix=('-' | '!' | NOT1 | EMPTY) expression
    | expression bop=('*' | '/' | '%' | MOD1 | DIV1) expression
    | expression bop=('+' | '-') expression
    | expression bop=('<=' | '>=' | '>' | '<' | LE1 | GE1 | LT1 | GT1) expression
    | expression bop=INSTANCEOF IDENTIFIER
    | expression bop=('==' | '!=' | EQ1 | NE1) expression
    | expression bop=('&&' | AND1) expression
    | expression bop=('||' | OR1) expression
    |  expression bop='?' expression bop=':' expression
    |  expression
           bop=('=' | '+=' | '-=' | '*=' | '/=')
           expression
    | lambdaExpression
    | lambdaExpressionOrCall
    ;

primary
    : '(' expression ')'
    | literal
    | IDENTIFIER
    ;

BOOL_LITERAL: TRUE | FALSE;
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER_LITERAL: [0-9]+;
FLOATING_POINT_LITERAL: [0-9]+ '.' [0-9]* EXPONENT? | '.' [0-9]+ EXPONENT? | [0-9]+ EXPONENT?;
fragment EXPONENT: ('e'|'E') ('+'|'-')? [0-9]+;


StringLiteral:                 ('"' DoubleStringCharacter* '"'
             |                  '\'' SingleStringCharacter* '\'') ;

fragment DoubleStringCharacter
    : ~["\
]
    | '\' EscapeSequence
    ;

fragment SingleStringCharacter
    : ~['\
]
    | '\' EscapeSequence
    ;
fragment EscapeSequence
    : CharacterEscapeSequence
    | '0'
    | HexEscapeSequence
    | UnicodeEscapeSequence
    | ExtendedUnicodeEscapeSequence
    ;
fragment CharacterEscapeSequence
    : SingleEscapeCharacter
    | NonEscapeCharacter
    ;
fragment HexEscapeSequence
    : 'x' HexDigit HexDigit
    ;

fragment UnicodeEscapeSequence
    : 'u' HexDigit HexDigit HexDigit HexDigit
    | 'u' '{' HexDigit HexDigit+ '}'
    ;
fragment ExtendedUnicodeEscapeSequence
    : 'u' '{' HexDigit+ '}'
    ;
fragment SingleEscapeCharacter
    : ['"\bfnrtv]
    ;

fragment NonEscapeCharacter
    : ~['"\bfnrtv0-9xu
]
    ;
fragment EscapeCharacter
    : SingleEscapeCharacter
    | [0-9]
    | [xu]
    ;
fragment HexDigit
    : [_0-9a-fA-F]
    ;
fragment DecimalIntegerLiteral
    : '0'
    | [1-9] [0-9_]*
    ;
fragment ExponentPart
    : [eE] [+-]? [0-9_]+
    ;
fragment IdentifierPart
    : IdentifierStart
    | [\p{Mn}]
    | [\p{Nd}]
    | [\p{Pc}]
    | '\u200C'
    | '\u200D'
    ;
fragment IdentifierStart
    : [\p{L}]
    | [$_]
    | '\' UnicodeEscapeSequence
    ;

LCURL: '{';
RCURL: '}';
LETTER:  '\u0024' |
                 '\u0041'..'\u005a' |
                 '\u005f' |
                 '\u0061'..'\u007a' |
                 '\u00c0'..'\u00d6' |
                 '\u00d8'..'\u00f6' |
                 '\u00f8'..'\u00ff' |
                 '\u0100'..'\u1fff' |
                 '\u3040'..'\u318f' |
                 '\u3300'..'\u337f' |
                 '\u3400'..'\u3d2d' |
                 '\u4e00'..'\u9fff' |
                 '\uf900'..'\ufaff';
DIGIT: '\u0030'..'\u0039'|
               '\u0660'..'\u0669'|
               '\u06f0'..'\u06f9'|
               '\u0966'..'\u096f'|
               '\u09e6'..'\u09ef'|
               '\u0a66'..'\u0a6f'|
               '\u0ae6'..'\u0aef'|
               '\u0b66'..'\u0b6f'|
               '\u0be7'..'\u0bef'|
               '\u0c66'..'\u0c6f'|
               '\u0ce6'..'\u0cef'|
               '\u0d66'..'\u0d6f'|
               '\u0e50'..'\u0e59'|
               '\u0ed0'..'\u0ed9'|
               '\u1040'..'\u1049';
TRUE: 'true';
FALSE: 'false';
NULL: 'null';
DOT: '.';
LPAREN: '(';
RPAREN: ')';
LBRACK: '[';
RBRACK: ']';
COLON: ':';
COMMA: ',';
SEMICOLON: ';';
GT0: '>';
GT1: 'gt';
LT0: '<';
LT1: 'lt';
GE0: '>=';
GE1: 'ge';
LE0: '<=';
LE1: 'le';
EQ0: '==';
EQ1: 'eq';
NE0: '!=';
NE1: 'ne';
NOT0: '!';
NOT1: 'not';
AND0: '&&';
AND1: 'and';
OR0: '||';
OR1: 'or';
EMPTY: 'empty';
INSTANCEOF: 'instanceof';
MULT: '*';
PLUS: '+';
MINUS: '-';
QUESTIONMARK: '?';
DIV0: '/';
DIV1: 'div';
MOD0: '%';
MOD1: 'mod';
CONCAT: '+=';
ASSIGN: '=';
ARROW: '->';
DOLLAR: '$';
HASH: '#';

WS: [ 	
]+ -> skip;

Mike Cargal · Accepted Answer

Move the Lexer rules for them to be prior to the Lexer rule for Identifier.

If ANTLR has more than one Lexer rule that matches input of the same length it chooses the first rule in the grammar that matches.

For example “mod” is matched by Identifier and MOD1, but Identifier is 1st, so it chooses Identifier. Move the MOD1 rule to be before Identifier and it’ll match MOD1

———-

BTW, unless you care about having different token values for “%” and “mod”, you can just define a single rule:

MOD: ‘%’ | ‘mod’;

You’d can still get the token text if you need it but it will you can just specify MOD in your parser rules instead of (MOD0 | MOD1)

My grammar identifiers keywords as identifiers

Answers (1)

Related Questions