strait
strait

Reputation: 13

How to achieve lex/flex-like start states with Antlr4 (or what are the proper semantics with Antlr4)

Distilled down to a very simple example, I have an input file with "name equals value" pairs. The name has restrictions on what characters are allowed, the value can have anything up to the new-line.

So the regular expression that matches a line would be something like this: [a-zA-Z0-9_]+=~[\r\n]+

Here's the Antlr4 grammar, which is not correct:

grammar example;

example_file
    : code* EOF
    ;

code
    : NAME '=' VALUE '\r'? '\n'
    | NAME '=' NAME '\r'? '\n'
    ;

NAME
    : [a-zA-Z0-9_]+
    ;

VALUE
    : ~[\r\n]+
    ;

Example input:

name1=value1
name2=[value2 with extra~ chars]

The online test ground (http://lab.antlr.org/) states '1:0 mismatched input 'name1=value1' expecting {, NAME}'

I believe the problem is that VALUE matches the entire string and is returned as one token by the lexer.

In (f)lex I would probably handle this by having a start state (e.g. %x VALUE), and so the lexer would keep the VALUE token exclusive to after the name and '=' have been recognized.

I've done quite a bit of Googling, but it's not clear to me how to handle this with Antlr4. (note again that this is a very distilled down example to focus on the main issue, which should be trivial - I would have written the code by hand if this is all that was needed ;-))

I've re-written the grammar several times, but it is becoming clear that I'm lacking some knowledge about Antlr. I did purchase the book.

Note that this question is similar, but the comments do not answer my question: I've Problems with ANTLR4 to parse key-value-pairs

Upvotes: 0

Views: 81

Answers (1)

cschneid
cschneid

Reputation: 10775

One way to get something close to what you want is...

grammar example;

example_file
    : code* EOF
    ;

code
    : NAME EQ (NAME | VALUE)+
    ;

NEWLINE
    : [\r\n]+
    ->channel(HIDDEN)
    ;

EQ
    : '='
    ;

NAME
    : [a-zA-Z0-9_]+
    ;

VALUE
    : ~[\r\n]+?
    ;

...but that makes the code rule a bit messy. One of the problems is that the VALUE token can include the equal sign and we have to use the non-greedy '?' modifier to get the VALUE rule to allow the EQ rule to work.

Another option would be to split your grammar into a lexer grammar and a parser grammar (two separate files, I called them Example1Lexer.g4 and Example1Parser.g4) and use a lexer mode...

lexer grammar Example1Lexer;

NEWLINE
    : [\r\n]+
    ->channel(HIDDEN)
    ;

EQ
    : '='
    ->pushMode(VALUE_MODE)
    ;

NAME
    : [a-zA-Z0-9_]+
    ;

mode VALUE_MODE;

VALUE_MODE_NEWLINE
    : NEWLINE
    ->channel(HIDDEN),popMode
    ;

VALUE
    : ~[\r\n]+
    ;
parser grammar Example1Parser;

options {tokenVocab=Example1Lexer;}

example_file
    : code* EOF
    ;

code
    : NAME EQ VALUE
    ;

...which might be a bit cleaner, or not, depending on your personal preferences.

Upvotes: 0

Related Questions