cgjohannsen
cgjohannsen

Reputation: 13

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:

i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;

But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.

Here is the grammar:

//// Parser Rules

grammar MLTL1;

start: block*;

block: var_list ';'
     | expr ';'
     ;

var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;

type: BASE_TYPE
    | KW_SET REL_LT BASE_TYPE REL_GT
    ;

expr: expr REL_OP expr
    | '(' expr ')'
    | IDENTIFIER 
    | INT
    ;

//// Lexical Spec

// Types
BASE_TYPE: 'bool'
         | 'int'
         | 'float'
         ;

// Keywords
KW_SET: 'set' ;

// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
      | REL_GTE | REL_LTE  ;

// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ; 

IDENTIFIER
  : LETTER (LETTER | DIGIT)*
  ;

INT
  : SIGN? NONZERODIGIT DIGIT*
  | '0'
  ;

fragment
SIGN
  : [+-]
  ;

fragment
DIGIT
  :  [0-9]
  ;

fragment
NONZERODIGIT
  : [1-9]
  ;

fragment
LETTER
  : [a-zA-Z_]
  ;

COMMENT : '#' ~[\r\n]* -> skip;
WS  :  [ \t\r\n]+ -> channel(HIDDEN);

I tested the grammar to see what tokens it is generating for the test input above using this python:

from antlr4 import InputStream, CommonTokenStream

import MLTL1Lexer
import MLTL1Parser

input="""
  i,j : bool;
  setvar: set<bool>;
  i > 5;
  j < 10;
"""

lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)

stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
  print(str(t.type) + " " + t.text)

parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))

And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?

Upvotes: 1

Views: 379

Answers (1)

Mike Cargal
Mike Cargal

Reputation: 6785

(There may be more than just these two instances, but...)

Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.

As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.

I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.

(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.

With the changes below, your sample input works just fine.

grammar MLTL1
    ;

start: block*;

block: var_list ';' | expr ';';

var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;

type: baseType | KW_SET REL_LT baseType REL_GT;

expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;

//// Lexical Spec

// Types
baseType: 'bool' | 'int' | 'float';

// Keywords
KW_SET: 'set';

// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;

// Relational ops
REL_EQ:  '==';
REL_NEQ: '!=';
REL_GT:  '>';
REL_LT:  '<';
REL_GTE: '>=';
REL_LTE: '<=';

IDENTIFIER: LETTER (LETTER | DIGIT)*;

INT: SIGN? NONZERODIGIT DIGIT* | '0';

fragment SIGN: [+-];

fragment DIGIT: [0-9];

fragment NONZERODIGIT: [1-9];

fragment LETTER: [a-zA-Z_];

COMMENT: '#' ~[\r\n]* -> skip;
WS:      [ \t\r\n]+   -> channel(HIDDEN);


Upvotes: 2

Related Questions