Lorenzo
Lorenzo

Reputation: 690

ANTLR match identifier but not reserved keywords

I am trying to match complex numbers using different notations, one of them using the cis function as such : MODULUS cis PHASE

The problem is that my identifier rule matches the cis as well as the start of the number following it and since it's bigger than the CIS token itself it always returns an identifier token type. How could i avoid that ?

Here's the grammar :

grammar Sandbox;

input : number? CIS UNSIGNED 
    | IDENTIFIER
    ;

number : FLOAT
    | UFLOAT 
    | UINT
    | INT
    ;

fragment DIGIT : [0-9] ;

UFLOAT : UINT (DOT UINT? | 'f') ;
FLOAT : SUB UFLOAT ;
UINT : DIGITS ;
INT : SUB UINT ;
UNSIGNED : UFLOAT 
    | UINT 
    ;
DIGITS : DIGIT+ ;

// Specific lexer rules
CIS : 'cis' ;
SUB : '-' ; 
DOT : '.' ;
WS : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;

IDENTIFIER : [a-zA-Z_]+[a-zA-Z0-9_]* ;  // has to be after complex so i or cis doesn't match this first

Edit: The input i was trying to parse with is the complex 1+i but using it's respective modulus and phase like this : 1.4142135623730951cis0.7853981633974483

And my actual problem is that the IDENTIFIER rule matches cis0 instead of just matching the CIS lexer rule even though it's defined before it.

I vaguely know that ANTLR chooses the rule based on the biggest match, but in this case i want to avoid that =o.

Upvotes: 3

Views: 850

Answers (2)

Mike Lischke
Mike Lischke

Reputation: 53345

I see two solutions here:

  1. Make the complex number a single lexer rule:
COMPLEX:  (FLOAT | UFLOAT | UINT | INT) WS* CIS WS* UNSIGNED;

which will be longer than an identifier or the pur CIS keyword (and hence matched first).

  1. A cis secquence is a keyword, when it follows a digit (with optional whitespaces between them), right? So, you could do a lookback (LA(-1) in your predicate to reject cis as identifier if that condition is true.

I'd prefer solution 1, because the convention is that single entities (and a complex number is, like a float number or a string, a single logicial entity) are match completely in a lexer rule, not in a parser rule.

Upvotes: 3

Lorenzo
Lorenzo

Reputation: 690

I'm just putting this here because i think this could be a potential solution, although i'd prefer not having to use semantic predicates because it ties my grammar to a target/specific language =/ (I never used them before so i'm not sure if there's any other caveats too):

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* { identifierIsNotReserved() }?;

And then we just need to implement the identifierIsNotReserved method to check if the identifier rule consumed a reserved keyword, and if so prevent the rule from being applied. And i quote:

A semantic predicate is a block of arbitrary code in the target language surrounded by {...}?, which evaluates to a boolean value. If the returned value is false, the lexer rule is skipped.

Edit: Forgot to add the reference to where i found this, here it is : https://riptutorial.com/antlr/example/11237/actions-and-semantic-predicates

Upvotes: 0

Related Questions