Reputation: 5742
From the book "The Definitive ANTLR 4 Reference":
Our STRING rule isn’t quite good enough yet because it doesn’t allow double quotes inside strings. To support that, most languages define escape sequences starting with a backslash. To get a double quote inside a double-quoted string, we use
\"
. To support the common escape characters, we need something like the following:
STRING : '"' ( ESC |.)*? '"' ;
fragment
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\
ANTLR itself needs to escape the escape character, so that’s why we need
\\
to specify the backslash character. The loop in STRING now matches either an escape character sequence, by calling fragment ruleESC
, or any single character via the dot wildcard. The*?
subrule operator terminates the(ESC |.)*?
That sounds fine, but when I read that I noticed a certain ambiguity in the choice between ESC
and .
. As far as STRING
is concerned, it is possible to match an input "Hi\""
by matching the escape character \
to the .
, and to consider the following escaped double-quote as closing the string. This would even be less greedy and so would conform better to the use of ?
.
The problem, of course, is that if we do that, then we have an extra double-quote at the end that does not get matched to anything.
So I wrote the following grammar:
grammar String;
anything: STRING '"'? '\r\n';
STRING: '"' (ESC|.)*? '"';
fragment
ESC: '\\"' | '\\\\';
which accepts an optional lonely double-quote character right after the string. This grammar still parses "Orange\""
as a full string:
So my question is: why is this the accepted parse, as opposed to the one taking "Orange\"
as the STRING
, followed by an isolated double-quote "
? Note that the latter would be less greedy, which would seem to conform better to the use of ?
, so one could think it would be preferable.
Upvotes: 1
Views: 335
Reputation: 5742
After some more experimentation, I realize the explanation is that the choice operator |
is order-dependent (but only under non-greedy operator ?
): ESC
is tried before .
. If I invert the two and write (.|ESC)*?
, I do get
This is not really surprising, but an interesting reminder that ANTLR is not as declarative as we may sometimes expect (in the sense that logic-or is order-independent but |
is not). It is also a good reminder that the non-greedy operator ?
does not extend its minimization capabilities to all choices, but just to the first one that matches the input (@sepp2k adds that order dependency only applies to the non-greedy case).
Upvotes: 1