Lexing token ambiguity in ANTLR4

Question

I have a very interesting problem with parsing the following grammar (of Convnetional Commits) - which is a convention how git commit messages should be formatted.

[optional scope]: 

[optional body]

[optional footer(s)]

the body is simply multi-line text where anything goes
the footer is key value pairs with fobar: this is value format and newline separating them.

Now, regarding my dilemma: what would be the best way to differentiate the body part from the footer part? According to the spec, those should be separated by two newline characters so at first I thought this would be good fit for ANTLR4 island grammars. I came up with something like what I posted here, but after some testing, I discovered it is not flexible - it won't work if the body is not there (body section is optional) but the footer is there.

I can think of a couple of ways to restrict the grammar to a certain language and implement this differentiation with semantic predicates but ideally, I would like to avoid that.

Now, I think that the problem boils down how to differentiate properly between KEY and SINGLE_LINE tokens which do conflict (in the next iteration of my implementation)

mode Text;
KEY: [a-z][a-z_-]+;
SINGLE_LINE: ~[
]+;

MULTI_LINE: SINGLE_LINE (NEWLINE SINGLE_LINE)*;

NEXT: NEWLINE NEWLINE;

What would be the best way to differentiate between KEY and SINGLE_LINE?

Bart Kiers · Accepted Answer

I'd do something like this:

ConventionalCommitsLexer.g4

lexer grammar ConventionalCommitsLexer;

options {
  caseInsensitive=true;
}

TYPE : [a-z]+;
LPAR : '(' -> pushMode(Scope);
COL  : ':' -> pushMode(Text);

fragment SPACE : [ 	];

mode Scope;

 SCOPE : ~[)]+;
 RPAR  : ')' SPACE* -> popMode;

mode Text;

 COL2    : ':' -> type(COL);
 SPACES : SPACE+ -> skip;
 WORD   : ~[: 	
]+;
 NL     : SPACE* '
'? '
' SPACE*;

ConventionalCommitsParser.g4

parser grammar ConventionalCommitsParser;

options {
  tokenVocab=ConventionalCommitsLexer;
}

commit
 : TYPE scope? COL description ( NL NL body )? ( NL NL footer )? EOF
 ;

scope
 : LPAR SCOPE RPAR
 ;

description
 : word+
 ;

// A 'body' cannot start with `WORD COL`, hence: `WORD WORD`
body
 : WORD WORD word* ( NL word+ )*
 ;

footer
 : key_value ( NL key_value )* NL?
 ;

key_value
 : WORD COL word+
 ;

word
 : WORD
 | COL
 ;

Parsing the input (body + footer):

fix(some_module): this is a commit description
    
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.

Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value

result:

Parsing the input (only body):

fix(some_module): this is a commit description
    
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.

result:

Parsing the input (only footer):

fix(some_module): this is a commit description

Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value

result:

Lexing token ambiguity in ANTLR4

Answers (1)

ConventionalCommitsLexer.g4

ConventionalCommitsParser.g4

Related Questions