Reputation: 2582
I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
Upvotes: 1
Views: 551
Reputation: 1642
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[@0,0:0='`',<'`'>,1:0]
[@1,1:4='Some',<VAR>,1:1]
[@2,6:9='text',<VAR>,1:6]
[@3,11:12='${',<'${'>,1:11]
[@4,13:20='variable',<VAR>,1:13]
[@5,21:21='.',<'.'>,1:21]
[@6,22:25='name',<VAR>,1:22]
[@7,26:26='}',<'}'>,1:26]
... shortened ...
[@26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some
which you intended to be TemplateStringLiteral*
was actually lexed to be VAR
. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral
rule only matches single characters, but your VAR
rule matches infinitely many, the lexer obviously uses the latter to match Some
.
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR
rule, ever?
The TemplateStringLiteral
rule now also matches ${
therefore prohibiting the correct recognition of the start of a template chunk.
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR
rule is now only active after ${
has been encountered and only stays active until }
is read. It thereby does not catch non-template text like Some
.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS
rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral
above WS
, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some
is lexed to VAR
.
Upvotes: 2