user118967
user118967

Reputation: 5742

String lexical rule in ANTLR with greedy wildcald and escape character

From the book "The Definitive ANTLR 4 Reference":

Our STRING rule isn’t quite good enough yet because it doesn’t allow double quotes inside strings. To support that, most languages define escape sequences starting with a backslash. To get a double quote inside a double-quoted string, we use \". To support the common escape characters, we need something like the following:

STRING ​: ​ ​'"' ​( ESC |.)*?​ ​'"' ​ ​;
fragment
ESC ​: ​ ​'\\"'  | ​ ​'\\\\' ​ ​; ​ ​// 2-char sequences \" and \\ 

​ ANTLR itself needs to escape the escape character, so that’s why we need \\ to specify the backslash character. The loop in STRING now matches either an escape character sequence, by calling fragment rule ESC, or any single character via the dot wildcard. The *? subrule operator terminates the (ESC |.)*?

That sounds fine, but when I read that I noticed a certain ambiguity in the choice between ESC and .. As far as STRING is concerned, it is possible to match an input "Hi\"" by matching the escape character \ to the ., and to consider the following escaped double-quote as closing the string. This would even be less greedy and so would conform better to the use of ?.

The problem, of course, is that if we do that, then we have an extra double-quote at the end that does not get matched to anything.

So I wrote the following grammar:

grammar String;

anything: STRING '"'? '\r\n';

STRING: '"' (ESC|.)*? '"';
fragment
ESC: '\\"' | '\\\\';

which accepts an optional lonely double-quote character right after the string. This grammar still parses "Orange\"" as a full string:

Parse of <code>"Orange\""</code> even if a later isolated double-quote is acceptable

So my question is: why is this the accepted parse, as opposed to the one taking "Orange\" as the STRING, followed by an isolated double-quote "? Note that the latter would be less greedy, which would seem to conform better to the use of ?, so one could think it would be preferable.

Upvotes: 1

Views: 335

Answers (1)

user118967
user118967

Reputation: 5742

After some more experimentation, I realize the explanation is that the choice operator | is order-dependent (but only under non-greedy operator ?): ESC is tried before .. If I invert the two and write (.|ESC)*?, I do get

Parse of <code>"Orange\""</code> even if a later isolated double-quote is acceptable after switching <code>ESC</code> and <code>.</code> in the grammar

This is not really surprising, but an interesting reminder that ANTLR is not as declarative as we may sometimes expect (in the sense that logic-or is order-independent but | is not). It is also a good reminder that the non-greedy operator ? does not extend its minimization capabilities to all choices, but just to the first one that matches the input (@sepp2k adds that order dependency only applies to the non-greedy case).

Upvotes: 1

Related Questions