user3830679
user3830679

Reputation:

Java regular expression to get characters between double quotes

I need to figure out a regular expression (Pattern) to be able to get characters between double quotes.

It's a little hard to explain, but here is what I want:

If I run this through said expression:

say("ex" + "ex2", "ex3");

I will then be able to get three matches, which are;

"ex", "ex2", and "ex3"

all in their own strings.

I've already tried this expression:

Pattern.compile("\\\"(.*)\\\"");

But instead of giving me three different .group()s, I get one .group which is "ex", "ex2", and "ex3"

So does anyone know an expression to give me the output I want?

Upvotes: 1

Views: 3770

Answers (1)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476659

You can do this using a non-greedy approach:

"\\\"(.*?)\\\""

A non-greedy cuts of a group from the moment it is possible. In this case from the moment a second double quote is found.

Or for instance match all characters appart from the quote:

"(\\\"[^\"]*)\\\")"

[^list] means all characters except the characters in the list

Furthermore, you can perhaps make it more readable by omitting double escaping:

"[\"]([^\"]*)[\"]"

Note furthermore that this doesn't work for recursive patterns: if the string to match is "foo "inner" bar", it will match "foo " and not "foo "inner" bar", but I guess that's the semantics one is looking for.

EDIT:

in case you allow escaped double quotes as well, you can use negative lookbehind:

"([\"][^\"]*(?<!\\\\)[\"])"

The (?<!\\\\) - unescaped (?<!\) - means that one character before, a backspace is not allowed.

A problem with this approach however, is that one can also specify a string:

"Foo\\"

This is used to specify the string Foo\ (a real backspace).

A possible solution is to check if the lookbehind contains an odd number of consecutive backslashes, but that is not supported by Java, the solution is to make the inner loop of matching more complicated:

"([\"]([^\\\\\"]*([\\\\].)*)*[\"])"

Unescaped this regex is:

(["]([^\\"]*([\\].)*)*["])
  ^    ^       ^       ^
  |    |       |       \- tailing double quote
  |    |       \- if backslash, skip next character (for instance `\\`, `\"` or `\n`
  |    \- match all except double quotes and backslashes
  \-beginning double quote

See this jdoodle, it reads a raw string from the stdin and outputs the captured groups.

Upvotes: 5

Related Questions