Reputation: 11744
I have a portion of an ANTLR4 rule that I'd like to parse backwards. I suspect that's not the real solution, so there's likely something I'm missing.
The crux of my problem is that there's a part in the middle of my expression that I'd like to extract. However, this part has some (defined) suffixes that I would really like to extract separately, if possible. These suffixes can be separated by a comma or not; the grammar works fine with the comma, but if the comma is missing, it takes the entire part as unknown
, even if the suffixes are present.
I've pared down my grammar into a small example, visible at the bottom of this post.
Given the string why hello, x y z foo bar baz blah blah blah, goodbye!
, my grammar will parse x y z foo bar baz
as a phrase
. I would like to match x y z
as unknown
and foo bar baz
as suffixes. If there is a comma (x y z, foo bar baz
), it works:
However, if there is no comma, it takes the entire x y z foo bar baz
(as well as some of the text after) as unknown
:
I tried changing unknown
to be nongreedy (+?
), but that is undesirable as well, consuming only one token for phrase
:
Is there a way to force the phrase
rule to try matching suffixes from the right first before falling back to unknown
?
Another way to put it: is there a way to have unknown
match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)
Example grammar:
grammar Example;
// parse tree root
exampleExpression : ignored HELLO separator phrase separator? unknown separator? GOODBYE ignored;
// what I want to match
phrase : unknown (COMMA? suffix+)*;
// convenience rule for swaths of tokens to be ignored (e.g. at the beginning and end)
ignored : (unknown | separator)*;
// roll up unknown tokens under one rule
unknown : (~(PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH))+;
separator : PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH;
// the pre-defined suffixes
suffix : FOO | BAR | BAZ;
/* TOKENS */
HELLO : 'hello';
GOODBYE : 'goodbye';
FOO : 'foo';
BAR : 'bar';
BAZ : 'baz';
/* FRAGMENTS */
fragment DIGIT : [0-9];
fragment DASH : '-';
/* REMAINING TOKENS */
LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
PERIOD : '.';
PIPE : '|';
BULLET : '\u00B7' | '\u2219' | '\u22c5';
SP_SEP_DASH : SP DASH SP;
SP : [ \u000B\t\r\n] -> channel(HIDDEN);
NUMBER : ([0] | [1-9] DIGIT*) ('.' DIGIT+)?;
WORD : [A-Za-z] [A-Za-z-]*;
// catch-all
OTHER : .;
Upvotes: 2
Views: 548
Reputation: 241671
The question says:
Another way to put it: is there a way to have unknown match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)
But previously, a parse of unknown
with internal suffixes was rejected:
However, if there is no comma, it takes the entire x y z foo bar baz (as well as some of the text after) as unknown
That seems inconsistent.
From the example, it seems like you are trying to do natural language parsing; ANTLR, whatever its virtues, is probably not a good tool for that. But that might just be a chimera based on your simplification.
In any event, the answer to your original question -- "is it possible to define a non-terminal as any sequence of tokens which don't end with one or more tokens from a suffix class" is "yes, that can be written as a context-free-grammar". Without getting into ANTLR specifics, here's a simple CFG:
wordlist: /* empty */ | wordlist non_suffix | wordlist suffix_list non_suffix ;
suffix_list: suffix | suffix_list suffix ;
Upvotes: 1