Reputation: 337
I want to parse the following string:
String text = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
// "w1 w"2" w3 | w4 w"5 "w6 w7"
I'm using Pattern.compile(regex).matcher(text)
, so what I'm missing here is the proper regex.
The rules are that regex has to:
So the resulting matches should be:
Whether the double quotes are included or not in the double quotes surrounded substrings is irrelevant (e.g. 1. could be either w1 w"2 or "w1 w"2").
What I came up with is something like this:
"\"(.*)\"|(\\S+)"
I also tried many diffent variants of the above regex (including lookbehind/forward) but none is giving me the expected result.
Any idea on how to improve this?
Upvotes: 3
Views: 1993
Reputation: 10360
Try this Regex:
(?:(?<=^")|(?<=\s")).*?(?="(?:\s|$))|(?![\s"])\S+
EXPLANATION:
(?:(?<=^")|(?<=\s"))
- Positive Lookbehind to find the position which is preceeded by a "
. This "
either needs to be at the start of the string or after a whitespace.*?
- matches 0+ occurrences of any character other than a newline character lazily(?="(?:\s|$))
- Positive lookahead to validate that whatever is matched so far is followed by either a whitespace or there is nothing after the match($
).|
- OR (either the above match or the following)(?![\s"])
- Negative lookahead to validate that the position in not followed by either a whitespace or a "
\S+
- matches 1+ occurrences of a non-whitespace characterJava Code(Generated from here):
Run code here to see the output
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MyClass {
public static void main(String args[]) {
final String regex = "(?:(?<=^\")|(?<=\\s\")).*?(?=\"(?:\\s|$))|(?![\\s\"])\\S+";
final String string = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
OUTPUT:
Upvotes: 2
Reputation: 48827
This seems to do the job:
"(?:[^"]|\b"\b)+"|\S+
Note that in Java, because we're using string literals for regexes, a backslash needs to be preceded by another backslash:
String regex = "\"(?:[^\"]|\\b\"\\b)+\"|\\S+";
Upvotes: 1