hendryau
hendryau

Reputation: 456

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?

Here's my grammar:

lexer grammar StringLexer;

// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;

Here's my java block:

String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s)); 

Token t = lexer.nextToken();

if (t.getType() == StringLexer.STRING) {
    System.out.println("Saw a String");
}
else {
    System.out.println("Nope");
}

This outputs Saw a String. Should "\" really match STRING?

Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.

Upvotes: 12

Views: 10817

Answers (2)

Sam Harwell
Sam Harwell

Reputation: 99859

For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:

'"' .*? '"'

To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.

StringLiteral
  : UnterminatedStringLiteral '"'
  ;

UnterminatedStringLiteral
  : '"' (~["\\\r\n] | '\\' (. | EOF))*
  ;

If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.

If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.

Upvotes: 15

Bart Kiers
Bart Kiers

Reputation: 170148

Yes, "\" is matched by the STRING rule:

            STRING: '"' (ESC|.)*? '"';
                     ^       ^     ^
                     |       |     |
// matches:          "       \     "

If you don't want the . to match the backslash (and quote), do something like this:

STRING: '"' ( ESC | ~[\\"] )* '"';

And if your string can't be spread over multiple lines, do:

STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

Upvotes: 8

Related Questions