Trygve Laugstøl
Trygve Laugstøl

Reputation: 7726

Solving ambiguous input: mismatched input

I have this grammar:

grammar MkSh;

script
  : (statement
    | targetRule
    )*
  ;

statement
  :  assignment
  ;

assignment
  :  ID '=' STRING
  ;

targetRule
  : TARGET ':' TARGET*
  ;

ID
  :  ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
  ;

WS
  : ( ' '
    | '\t'
    | '\r'
    | '\n'
    ) -> channel(HIDDEN)
  ;

STRING
  : '\"' CHR* '\"'
  ;

fragment
CHR
  : ('a'..'z'|'A'..'Z'|' ')
  ;

TARGET
  :  ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
  ;

and this input file:

hello="world"

target: CLASSES

When running my parser I'm getting this error:

line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='

Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').

How can I get that to happen?

(This is my first Antlr project so I'm open to anything.)

Upvotes: 1

Views: 771

Answers (2)

Ron Burk
Ron Burk

Reputation: 6231

As @cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.

(Warning: writing off the top of my head without testing :-)

Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:

id : TARGET
    { complain if not legal identifier (e.g., contains slashes, etc.) }
    ;
assignment
  :  id '=' STRING
  ;

Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

Upvotes: 1

cantSleepNow
cantSleepNow

Reputation: 10202

First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.

Upvotes: 1

Related Questions