Martin Cup
Martin Cup

Reputation: 2582

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:

`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`

Here is my grammar:

varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)*  ')' ;

WS      : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;

When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:

extraneous input ' ' expecting {'`'}

How can I allow whitespaces to be parsed and not skipped only inside the template string?

Upvotes: 1

Views: 551

Answers (1)

AplusKminus
AplusKminus

Reputation: 1642

What is currently happening

When testing your example against your current grammar displaying the generated tokens, the lexer gives this:

[@0,0:0='`',<'`'>,1:0]
[@1,1:4='Some',<VAR>,1:1]
[@2,6:9='text',<VAR>,1:6]
[@3,11:12='${',<'${'>,1:11]
[@4,13:20='variable',<VAR>,1:13]
[@5,21:21='.',<'.'>,1:21]
[@6,22:25='name',<VAR>,1:22]
[@7,26:26='}',<'}'>,1:26]
... shortened ...
[@26,85:84='<EOF>',<EOF>,2:0]

This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?

As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.

What you could try (Spoiler: won't work)

You could try to modify the rule like this:

TemplateStringLiteral: ('\\`' | ~'`')+ ;

so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:

  1. How would the lexer match anything to the VAR rule, ever?

  2. The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.

How to achieve what you actually want

There might be another solution, but this one works:

File MartinCup.g4:

parser grammar MartinCup;

options { tokenVocab=MartinCupLexer; }

templateString
    : BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
    ;

template
    : TemplateStart variable TemplateEnd
    ;

variable
    : varname funParameter? (Dot variable)*
    ;

varname
    : VAR
    ;

funParameter
    : OpenPar variable? (Comma variable)* ClosedPar
    ;

File MartinCupLexer.g4:

lexer grammar MartinCupLexer;

BackTick : '`' ;

TemplateStart
    : '${' -> pushMode(templateMode)
    ;

TemplateStringLiteral
    : '\\`'
    | ~'`'
    ;

mode templateMode;

VAR
    : [$]?[a-zA-Z0-9_]+
    | [$]
    ;

OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;

TemplateEnd
    : '}' -> popMode;

This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.

Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.

About the whitespaces

I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.

I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:

line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}

The reason for this is the same as above, Some is lexed to VAR.

Upvotes: 2

Related Questions