Reputation: 61
I am trying to update an ANTLR grammar that follows the following spec:
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as:
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Note the above is logical - not ANTLR syntax).
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string:
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails:
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '"""' escaping?
Upvotes: 5
Views: 562
Reputation: 170278
Some of your parser rules should really be lexer rules. And SourceCharacter
should probably be a fragment
.
Also, instead of EscapedTripleQuote* | SourceCharacter*
, you probably want ( EscapedTripleQuote | SourceCharacter )*
. The first matches aaa...
or bbb...
, while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
Upvotes: 2
Reputation: 31
Triple quoted strings are often used to allow multi-line strings and unescaped characters inside a string. Assuming that you are skipping spaces and linebreaks, parsing triple quotes can be quite tricky, because there are some corner cases like:
In order to cope with the above issues a grammar with modes can be used is:
Lexer:
START_TRIPLE_QUOTE: '"""' -> pushMode(INSIDE_TRIPLE_QUOTE);
mode INSIDE_TRIPLE_QUOTE;
TRIPLE_QUOTED_STRING_CONTENT : '"' '"'? ~["] // Match one or two quotes followed by a non-quote
| ~["] // Match any character that is not a quote
;
TRIPLE_QUOTE_END_2: '"""""' -> popMode;
TRIPLE_QUOTE_END_1: '""""' -> popMode;
TRIPLE_QUOTE_END_0: '"""' -> popMode;
Parser:
triple_string_literal: START_TRIPLE_QUOTE (TRIPLE_QUOTED_STRING_CONTENT)*
(TRIPLE_QUOTE_END_2
| TRIPLE_QUOTE_END_1
| TRIPLE_QUOTE_END_0);
And in your Listener/Visitor:
TripleQuotedStringConst(ctx.getText().substring(3, ctx.getText().length() - 3))
As a reference here is an article that I wrote: https://medium.com/@alexzerntev/parsing-multi-line-triple-quoted-strings-with-antlr4-ceca41cdeadb
Upvotes: 0