NiccoMlt
NiccoMlt

Reputation: 305

Handle newlines explicitly only in part of an ANTLR grammar

I'm currently developing a parser for an old proprietary markup-like language which has to be converted to a newer standard. I'm using ANTLR 4 for that.

The structure is composed by blocks delimited by a specific starter and its relative terminator (eg. { ... }, < ... >, INPUT ... END). Inside each block, elements are specified in rows, separated by newlines; actually, only somewhere these newlines are needed to understand what code means.

For example:

< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>

A parser rule like the following

field
  : OPEN_ANGLED_BRACKET row_id
    ((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR?)+
    CLOSED_ANGLED_BRACKET
  ;

// [...]

WHITESPACE
  : [ \t\r\n] -> skip
  ;

can easily parse the above example, but because newlines are ignored, it can't actually distinguish if a double-quoted string is a constant (meaning it's at the start of the line) or a modifier string (which follows a previous variable/constant in the same line).

I could actually explicitly handle the newline like this:

field
  : OPEN_ANGLED_BRACKET row_id NEWLINE
    ((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR? NEWLINE)+
    CLOSED_ANGLED_BRACKET NEWLINE
  ;

// [...]

WHITESPACE
  : [ \t] -> skip
  ;

NEWLINE
  : '\r'? '\n'
  | '\r'
  ;

but then I must explicitly handle newline everywhere in the rest of the grammar, complicating it by a lot!

Is there any way to keep explicit newline confined inside angled brakets, skipping it everywhere else "automatically"?

Upvotes: 4

Views: 1560

Answers (2)

Mike Lischke
Mike Lischke

Reputation: 53337

I wanted to come up with a solution that doesn't use lexer modes (as I find them ugly) and therefore modified Bart's grammar:

grammar SOGrammar;

start:
    OPEN_ANGLED_BRACKET ROW_ID ROW* CLOSED_ANGLED_BRACKET
;

VAR:             [a-zA-Z_]+;
ENVIRONMENT_VAR: '$' VAR;
DQUOTE_STR:      '"' .*? '"';

OPEN_ANGLED_BRACKET: '<';
CLOSED_ANGLED_BRACKET: '>';

ROW_ID: VAR LINEBREAK;
ROW:
    (ENVIRONMENT_VAR | DQUOTE_STR | VAR) (SPACE* DQUOTE_STR)? LINEBREAK
;

SPACE: [ \t] -> skip;
LINEBREAK: [\r\n] -> skip;

The idea here is that a row can be handled entirely in the lexer where we have control over whitespaces.

The parse tree is:

enter image description here

Upvotes: 2

Bart Kiers
Bart Kiers

Reputation: 170158

You could use lexical modes here. You'd have to define separate lexer- and parser grammars to use lexical modes.

Whenever you encounter a ENVIRONMENT_VAR, VAR or DQUOTE_STR in the lexer (the first value in a row), you change the lexical mode. In this new lexical mode you match 3 things: strings, spaces (which you skip) and new lines (which you also skip and after this token, you change back to the default mode). This all might sound a bit vague, so here's a short demo of it all:

File: MarkupLexer.g4

lexer grammar MarkupLexer;

ENVIRONMENT_VAR       : '$' VAR    -> mode(MODIFIER_MODE);
VAR                   : [a-zA-Z_]+ -> mode(MODIFIER_MODE);
DQUOTE_STR            : STR        -> mode(MODIFIER_MODE);
OPEN_ANGLED_BRACKET   : '<';
CLOSED_ANGLED_BRACKET : '>';
SPACES                : [ \t\r\n] -> skip;

fragment STR : '"' ~["\r\n]* '"';

mode MODIFIER_MODE;

  MODIFIER_MODE_SPACES : [ \t] -> skip;
  MODIFIER_MODE_NL     : [\r\n]+ -> skip, mode(DEFAULT_MODE);
  MODIFIER_MODE_STRING : STR;

The parser will look like this:

File: MarkupParser.g4

parser grammar MarkupParser;

options {
  tokenVocab=MarkupLexer;
}

field
  : OPEN_ANGLED_BRACKET row_id row+ CLOSED_ANGLED_BRACKET
  ;

row
 : (ENVIRONMENT_VAR | DQUOTE_STR | VAR) MODIFIER_MODE_STRING?
 ;

row_id
 : VAR
 ;

When you parse the input:

< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>

you will get the following parse tree:

enter image description here

Upvotes: 2

Related Questions