Muhammad Shahbaz
Muhammad Shahbaz

Reputation: 23

Using regex to only match those Strings which use escape character correctly (according to Java syntax)?

take these strings for example:

"hello world\n" (correct - regex should match this)

"I'm happy \ here" (this is incorrect as the escape character is not used correctly - regex should not match this one)

I've tried searching on google but didn't find anything helpful.

I want this one to be used in a parser which only parses string literals from a java code file.

Here is the the regex I used:

"\\\"(\\[tbnrf\'\"\\])*[a-zA-Z0-9\\`\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\_\\-\\+\\=\\|\\{\\[\\}\\]\\;\\:\\'\\/\\?\\>\\.\\<\\,]\\\""

what am I doing wrong?

Upvotes: 1

Views: 738

Answers (1)

Ralf Kleberhoff
Ralf Kleberhoff

Reputation: 7290

I guess you gave us the regex in Java String literal form, like

String regex = \"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\@\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\";

Unpacking that from Java's String escaping syntax gives the raw regex:

\"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\@\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\"

That consists of:

  • \" matching a double-quote character (Java String literal begins here). Escaping the double quotes with backslash isn't necessary: " on its own is ok as well.
  • (\[tbnrf'"\])*: a group, repeated 0...n times. I guess you want that to match against the various Java backslash escapes, but that should read (\\[tbnrf'"\\])* with a double backslash in front and inside the character class. And maybe you want to cover the Java octal escapes as well (see the language specification), giving (\\[tbnrf01234567'"\\])*
  • [a-zA-Z0-9\``\~\!\@\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]: a character class matching one character from a selected list of alphabetic and punctuation characters. I'd replace that with [^"\\], meaning anything but double quote or backslash.
  • \" matching a double-quote character (string literal ends here). Once again, no need to escape the double quote.

Besides the individual elements, the overall structure of the regex probably isn't what you want: You allow only strings beginning with any number of backslash escapes, followed by exactly one non-escape character, and this enclosed in a pair of double quotes.

The overall structure should instead be "(backslash_escape|simple_character)*"

So, the complete regex would be:

"(\\[tbnrf01234567'"\\]|[^"\\])*"

or, expressed in a Java literal:

String regex = "\"(\\\\[tbnrf01234567'\"\\\\]|[^\"\\\\])*\"";

And, although this is shorter than your original attempt, I'd still not call it readable and opt for a different implementation, not using regular expressions.

P.S. Although I did some testing with my regex, I'm not at all sure that it covers all relevant cases correctly.

P.P.S. There are the \uxxxx escapes, not yet covered by the regex.

Upvotes: 1

Related Questions