Handle newlines explicitly only in part of an ANTLR grammar

Question

I'm currently developing a parser for an old proprietary markup-like language which has to be converted to a newer standard. I'm using ANTLR 4 for that.

The structure is composed by blocks delimited by a specific starter and its relative terminator (eg. { ... }, < ... >, INPUT ... END). Inside each block, elements are specified in rows, separated by newlines; actually, only somewhere these newlines are needed to understand what code means.

For example:

< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>

A parser rule like the following

field
  : OPEN_ANGLED_BRACKET row_id
    ((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR?)+
    CLOSED_ANGLED_BRACKET
  ;

// [...]

WHITESPACE
  : [ 	
] -> skip
  ;

can easily parse the above example, but because newlines are ignored, it can't actually distinguish if a double-quoted string is a constant (meaning it's at the start of the line) or a modifier string (which follows a previous variable/constant in the same line).

I could actually explicitly handle the newline like this:

field
  : OPEN_ANGLED_BRACKET row_id NEWLINE
    ((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR? NEWLINE)+
    CLOSED_ANGLED_BRACKET NEWLINE
  ;

// [...]

WHITESPACE
  : [ 	] -> skip
  ;

NEWLINE
  : '
'? '
'
  | '
'
  ;

but then I must explicitly handle newline everywhere in the rest of the grammar, complicating it by a lot!

Is there any way to keep explicit newline confined inside angled brakets, skipping it everywhere else "automatically"?

Bart Kiers · Accepted Answer

You could use lexical modes here. You'd have to define separate lexer- and parser grammars to use lexical modes.

Whenever you encounter a ENVIRONMENT_VAR, VAR or DQUOTE_STR in the lexer (the first value in a row), you change the lexical mode. In this new lexical mode you match 3 things: strings, spaces (which you skip) and new lines (which you also skip and after this token, you change back to the default mode). This all might sound a bit vague, so here's a short demo of it all:

File: MarkupLexer.g4

lexer grammar MarkupLexer;

ENVIRONMENT_VAR       : '$' VAR    -> mode(MODIFIER_MODE);
VAR                   : [a-zA-Z_]+ -> mode(MODIFIER_MODE);
DQUOTE_STR            : STR        -> mode(MODIFIER_MODE);
OPEN_ANGLED_BRACKET   : '<';
CLOSED_ANGLED_BRACKET : '>';
SPACES                : [ 	
] -> skip;

fragment STR : '"' ~["
]* '"';

mode MODIFIER_MODE;

  MODIFIER_MODE_SPACES : [ 	] -> skip;
  MODIFIER_MODE_NL     : [
]+ -> skip, mode(DEFAULT_MODE);
  MODIFIER_MODE_STRING : STR;

The parser will look like this:

File: MarkupParser.g4

parser grammar MarkupParser;

options {
  tokenVocab=MarkupLexer;
}

field
  : OPEN_ANGLED_BRACKET row_id row+ CLOSED_ANGLED_BRACKET
  ;

row
 : (ENVIRONMENT_VAR | DQUOTE_STR | VAR) MODIFIER_MODE_STRING?
 ;

row_id
 : VAR
 ;

When you parse the input:

< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>

you will get the following parse tree:

Handle newlines explicitly only in part of an ANTLR grammar

Answers (2)

File: MarkupLexer.g4

File: MarkupParser.g4

Related Questions