Match sub-rule backwards in ANTLR4 parser

Question

I have a portion of an ANTLR4 rule that I'd like to parse backwards. I suspect that's not the real solution, so there's likely something I'm missing.

The crux of my problem is that there's a part in the middle of my expression that I'd like to extract. However, this part has some (defined) suffixes that I would really like to extract separately, if possible. These suffixes can be separated by a comma or not; the grammar works fine with the comma, but if the comma is missing, it takes the entire part as unknown, even if the suffixes are present.

I've pared down my grammar into a small example, visible at the bottom of this post.

Given the string why hello, x y z foo bar baz blah blah blah, goodbye!, my grammar will parse x y z foo bar baz as a phrase. I would like to match x y z as unknown and foo bar baz as suffixes. If there is a comma (x y z, foo bar baz), it works: tree generated with comma

However, if there is no comma, it takes the entire x y z foo bar baz (as well as some of the text after) as unknown: tree generated with no comma

I tried changing unknown to be nongreedy (+?), but that is undesirable as well, consuming only one token for phrase: tree generated with no comma and nongreedy unknown

Is there a way to force the phrase rule to try matching suffixes from the right first before falling back to unknown?

Another way to put it: is there a way to have unknown match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)

Example grammar:

grammar Example;

// parse tree root
exampleExpression : ignored HELLO separator phrase separator? unknown separator? GOODBYE ignored;

// what I want to match
phrase : unknown (COMMA? suffix+)*;

// convenience rule for swaths of tokens to be ignored (e.g. at the beginning and end)
ignored : (unknown | separator)*;

// roll up unknown tokens under one rule
unknown : (~(PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH))+;
separator : PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH;

// the pre-defined suffixes
suffix : FOO | BAR | BAZ;

/* TOKENS */

HELLO : 'hello';
GOODBYE : 'goodbye';
FOO : 'foo';
BAR : 'bar';
BAZ : 'baz';

/* FRAGMENTS */

fragment DIGIT : [0-9];
fragment DASH : '-';

/* REMAINING TOKENS */

LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
PERIOD : '.';
PIPE : '|';
BULLET : '\u00B7' | '\u2219' | '\u22c5';
SP_SEP_DASH : SP DASH SP;

SP : [ \u000B	
] -> channel(HIDDEN);

NUMBER : ([0] | [1-9] DIGIT*) ('.' DIGIT+)?;
WORD : [A-Za-z] [A-Za-z-]*;

// catch-all
OTHER : .;

Match sub-rule backwards in ANTLR4 parser

Answers (1)

Related Questions