voccoeisuoi
voccoeisuoi

Reputation: 337

Java regex to match double quoted substrings

I want to parse the following string:

String text = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
// "w1 w"2" w3 | w4 w"5 "w6 w7"

I'm using Pattern.compile(regex).matcher(text), so what I'm missing here is the proper regex. The rules are that regex has to:

So the resulting matches should be:

  1. w1 w"2
  2. w3
  3. |
  4. w4
  5. w"5
  6. w6 w7

Whether the double quotes are included or not in the double quotes surrounded substrings is irrelevant (e.g. 1. could be either w1 w"2 or "w1 w"2").

What I came up with is something like this:

"\"(.*)\"|(\\S+)"

I also tried many diffent variants of the above regex (including lookbehind/forward) but none is giving me the expected result.

Any idea on how to improve this?

Upvotes: 3

Views: 1993

Answers (2)

Gurmanjot Singh
Gurmanjot Singh

Reputation: 10360

Try this Regex:

(?:(?<=^")|(?<=\s")).*?(?="(?:\s|$))|(?![\s"])\S+

Click for Demo

EXPLANATION:

  • (?:(?<=^")|(?<=\s")) - Positive Lookbehind to find the position which is preceeded by a ". This " either needs to be at the start of the string or after a whitespace
  • .*? - matches 0+ occurrences of any character other than a newline character lazily
  • (?="(?:\s|$)) - Positive lookahead to validate that whatever is matched so far is followed by either a whitespace or there is nothing after the match($).
  • | - OR (either the above match or the following)
  • (?![\s"]) - Negative lookahead to validate that the position in not followed by either a whitespace or a "
  • \S+ - matches 1+ occurrences of a non-whitespace character

Java Code(Generated from here):

Run code here to see the output

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MyClass {
    public static void main(String args[]) {
    final String regex = "(?:(?<=^\")|(?<=\\s\")).*?(?=\"(?:\\s|$))|(?![\\s\"])\\S+";
    final String string = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";

    final Pattern pattern = Pattern.compile(regex);
    final Matcher matcher = pattern.matcher(string);

    while (matcher.find()) {
        System.out.println("Full match: " + matcher.group(0));
        for (int i = 1; i <= matcher.groupCount(); i++) {
            System.out.println("Group " + i + ": " + matcher.group(i));
        }
    }

    }
}

OUTPUT:

enter image description here

Upvotes: 2

sp00m
sp00m

Reputation: 48827

This seems to do the job:

"(?:[^"]|\b"\b)+"|\S+

Debuggex Demo

Regex101 Demo


Note that in Java, because we're using string literals for regexes, a backslash needs to be preceded by another backslash:

String regex = "\"(?:[^\"]|\\b\"\\b)+\"|\\S+";

Upvotes: 1

Related Questions