Reputation: 305
I'm currently developing a parser for an old proprietary markup-like language which has to be converted to a newer standard. I'm using ANTLR 4 for that.
The structure is composed by blocks delimited by a specific starter and its relative terminator (eg. {
... }
, <
... >
, INPUT
... END
). Inside each block, elements are specified in rows, separated by newlines; actually, only somewhere these newlines are needed to understand what code means.
For example:
< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>
A parser rule like the following
field
: OPEN_ANGLED_BRACKET row_id
((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR?)+
CLOSED_ANGLED_BRACKET
;
// [...]
WHITESPACE
: [ \t\r\n] -> skip
;
can easily parse the above example, but because newlines are ignored, it can't actually distinguish if a double-quoted string is a constant (meaning it's at the start of the line) or a modifier string (which follows a previous variable/constant in the same line).
I could actually explicitly handle the newline like this:
field
: OPEN_ANGLED_BRACKET row_id NEWLINE
((ENVIRONMENT_VAR | DQUOTE_STR | VAR) DQUOTE_STR? NEWLINE)+
CLOSED_ANGLED_BRACKET NEWLINE
;
// [...]
WHITESPACE
: [ \t] -> skip
;
NEWLINE
: '\r'? '\n'
| '\r'
;
but then I must explicitly handle newline everywhere in the rest of the grammar, complicating it by a lot!
Is there any way to keep explicit newline confined inside angled brakets, skipping it everywhere else "automatically"?
Upvotes: 4
Views: 1560
Reputation: 53337
I wanted to come up with a solution that doesn't use lexer modes (as I find them ugly) and therefore modified Bart's grammar:
grammar SOGrammar;
start:
OPEN_ANGLED_BRACKET ROW_ID ROW* CLOSED_ANGLED_BRACKET
;
VAR: [a-zA-Z_]+;
ENVIRONMENT_VAR: '$' VAR;
DQUOTE_STR: '"' .*? '"';
OPEN_ANGLED_BRACKET: '<';
CLOSED_ANGLED_BRACKET: '>';
ROW_ID: VAR LINEBREAK;
ROW:
(ENVIRONMENT_VAR | DQUOTE_STR | VAR) (SPACE* DQUOTE_STR)? LINEBREAK
;
SPACE: [ \t] -> skip;
LINEBREAK: [\r\n] -> skip;
The idea here is that a row can be handled entirely in the lexer where we have control over whitespaces.
The parse tree is:
Upvotes: 2
Reputation: 170158
You could use lexical modes here. You'd have to define separate lexer- and parser grammars to use lexical modes.
Whenever you encounter a ENVIRONMENT_VAR
, VAR
or DQUOTE_STR
in the lexer (the first value in a row), you change the lexical mode. In this new lexical mode you match 3 things: strings, spaces (which you skip) and new lines (which you also skip and after this token, you change back to the default mode). This all might sound a bit vague, so here's a short demo of it all:
lexer grammar MarkupLexer;
ENVIRONMENT_VAR : '$' VAR -> mode(MODIFIER_MODE);
VAR : [a-zA-Z_]+ -> mode(MODIFIER_MODE);
DQUOTE_STR : STR -> mode(MODIFIER_MODE);
OPEN_ANGLED_BRACKET : '<';
CLOSED_ANGLED_BRACKET : '>';
SPACES : [ \t\r\n] -> skip;
fragment STR : '"' ~["\r\n]* '"';
mode MODIFIER_MODE;
MODIFIER_MODE_SPACES : [ \t] -> skip;
MODIFIER_MODE_NL : [\r\n]+ -> skip, mode(DEFAULT_MODE);
MODIFIER_MODE_STRING : STR;
The parser will look like this:
parser grammar MarkupParser;
options {
tokenVocab=MarkupLexer;
}
field
: OPEN_ANGLED_BRACKET row_id row+ CLOSED_ANGLED_BRACKET
;
row
: (ENVIRONMENT_VAR | DQUOTE_STR | VAR) MODIFIER_MODE_STRING?
;
row_id
: VAR
;
When you parse the input:
< ID
SOME_VAR "optional modifier string"
$anEnvironmentVariable
"a constant string"
"another constant" "with its optional modifier"
>
you will get the following parse tree:
Upvotes: 2