Martin Cup
Martin Cup

Reputation: 2582

antlr grammar: Lexer matches "impossible" rule

I got this parser grammar with which I also want to use something similar to Javascript template-strings.

parser grammar Test;

options {
  tokenVocab = TestLexer;
}

definition: sourceElements? EOF ;

sourceElements: sourceElement+ ;

sourceElement: mapping ;


templateString: '`' TemplateStringCharacter* ('${' variable '}' TemplateStringCharacter*)+ '`' ;
fieldName: varname | ('[' value ']') ;
mapping: fieldName ':' ( '{' sourceElements '}'
      | variable ( '{' sourceElements '}' )? '?'?
      | value
      | array )
      ;

funParameter: '(' value? (',' value)*  ')' ;
array: '[' value? (',' value)* ']';
variable: (varname | '{' value '}' | '[' boolEx ']' | templateString) funParameter? ('.' variable)* ;
value: INT | BOOL | FLOAT | STRING | variable ;
varname: VAR ;

And this lexer grammar

lexer grammar TestLexer;

WS : [ \t\r\n\u000C]+ -> skip ;
NEWLINE : [\r\n] ;
BOOL : ('true'|'false') ;
TemplateStringLiteral : TemplateStringCharacter*;
VAR : [$]?[a-zA-Z0-9_]+|[@] ;
INT : '-'?[0-9]+ ;
FLOAT : '-'?[0-9]+'.'[0-9]+ ;
STRING : '"' DoubleStringCharacter* '"' | '\'' SingleStringCharacter* '\'' ;
TEMPSTART : '${' ;
TEMPEND : '}' ;

TemplateStart : '`' -> pushMode(template) ;

/// Comments
MultiLineComment : '/*' .*? '*/' -> channel(HIDDEN) ;
SingleLineComment : '//' ~[\r\n\u2028\u2029]* -> channel(HIDDEN) ;

mode template;
TemplateVariableStart: TEMPSTART -> pushMode(templateVariable);
TemplateStringLiteral : TemplateStringCharacter* ;
TemplateEnd : '`' -> popMode;

mode templateVariable;
WS : [ \t\r\n\u000C]+ -> skip ;
All : [^}]+ ;
TemplateVariableEnd : TEMPEND -> popMode;

fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment TemplateStringCharacter : ~[`] ;
fragment DecimalDigit : [0-9] ;

When I input this:

test: {
  abc: `Hello World`
}

The parsing tree looks like this:

(definition 
  (sourceElements 
    (sourceElement 
      (statement 
        (mapping 
          (fieldName 
            (varname test)
          ) : { 
          (sourceElements
            (sourceElement
              (statement mapping)
            ) 
            (sourceElement
              (statement
                (mapping abc : `)
              )
            ) 
            (sourceElement 
              (statement mapping)
            ) 
            (sourceElement 
              (statement 
                (mapping Hello)
              )
            ) 
            (sourceElement 
              (statement
                (mapping World `)
              )
            )
          ) 
          }
        )
      )
    )
  ) 
  <EOF>
)

And I get the error: line 2:8 no viable alternative at input 'abc:`Hello'

I don't understand, why it is even possible to match something like an empty mapping or a mapping like "World `" because a mapping would need to have a ":" in the middle. And why is the rule templateString not matching the whole "Hello World" from back tick to back tick?

EDIT:

After noticing that the Lexer wasn't regenerated when I thought it was I got errors like: "cannot create implicit token for string literal in non-combined grammar: ']'". So I had to move all implicit declarations to the lexer grammar. So I changed the code to this:

parser grammar Test;

options {
  tokenVocab = TestLexer;
}

definition: sourceElements? EOF ;

sourceElements: sourceElement+ ;

sourceElement: mapping ;

templateString: OpenBackTick TemplateStringLiteral* (TemplateVariableStart variable CloseBrace TemplateStringLiteral*)+ CloseBackTick ;
fieldName: varname | OpenBracket value CloseBracket ;
mapping: fieldName Colon (
      OpenBrace sourceElements CloseBrace
      | variable ( OpenBrace sourceElements CloseBrace )? IF?
      | value
      | array
    )
    ;

funParameter: OpenParen value? (Comma value)* CloseParen ;
array: OpenBracket value? (Comma value)* CloseBracket;
variable: (varname | OpenBrace value CloseBrace | templateString) funParameter? (Dot variable)* ;
value: INT | BOOL | FLOAT | STRING | variable ;
varname: VAR ;

And lexer grammar:

lexer grammar TestLexer;

OpenBracket: '[';
CloseBracket: ']';
OpenParen: '(';
CloseParen: ')';
OpenBrace: '{' ;
CloseBrace: '}' ;
IF: '?' ;
AND: 'AND' ;
OR: 'OR';
LessThan: '<';
MoreThan: '>';
LessThanEquals:   '<=';
GreaterThanEquals:   '>=';
Equals: '=';
NotEquals: '!=';
IN: 'IN';
NOT: '!';
Colon: ':';
Dot: '.' ;
Comma: ',' ;
OpenBackTick : '`' -> pushMode(template) ;

WS : [ \t\r\n\u000C]+ -> skip ;
NEWLINE : [\r\n] ;
BOOL : ('true'|'false') ;
VAR : [$]?[a-zA-Z0-9_]+|[@] ;
INT : '-'?[0-9]+ ;
FLOAT : '-'?[0-9]+'.'[0-9]+ ;
STRING : '"' DoubleStringCharacter* '"' | '\'' SingleStringCharacter* '\'' ;

/// Comments
MultiLineComment : '/*' .*? '*/' -> channel(HIDDEN) ;
SingleLineComment : '//' ~[\r\n\u2028\u2029]* -> channel(HIDDEN) ;

mode template;
TemplateVariableStart: '${' -> pushMode(templateVariable);
CloseBackTick : '`' -> popMode;
TemplateStringLiteral: TemplateStringCharacter ;

mode templateVariable;
WHS : [ \t\r\n\u000C]+ -> skip ;
All : [^}]+ ;
TemplateVariableEnd : CloseBrace -> popMode;

fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment TemplateStringCharacter : ~[`] ;
fragment DecimalDigit : [0-9] ;

Now I get the error: line 1:0 mismatched input 'test' expecting {, '?', '[', VAR} Which is strange, cause 'test' should be matched by VAR. Any ideas why this is happening?

Upvotes: 0

Views: 425

Answers (1)

sepp2k
sepp2k

Reputation: 370327

There are two lexer rules in your default mode that can match a backtick: BTICK and TemplateStart. TemplateStart will switch to the template mode, but BTICK will not. Since BTICK comes first in your grammar, so it takes precedence. That means when the lexer sees a backtick, it will generate a BTICK token and not switch modes.

To fix this you should have only one lexer rule per mode that matches a backtick and that rule should change the mode.

I don't understand, why it is even possible to match something like an empty mapping or a mapping like "World `" because a mapping would need to have a ":" in the middle.

When your input contains a syntax error, the generated parse tree can contain constructs that aren't actually valid either. When your input parses without errors, you'll get a tree that makes sense.

Upvotes: 1

Related Questions