Reputation: 1918
I am working on a project in Java that requires having nested strings.
For an input string that in plain text looks like this:
This is "a string" and this is "a \"nested\" string"
The result must be the following:
[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"
Note that I want the \"
sequences to be kept.
I have the following method:
public static String[] splitKeepingQuotationMarks(String s);
and I need to create an array of strings out of the given s
parameter by the given rules, without using the Java Collection Framework or its derivatives.
I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?
UPDATE based on questions from comments:
"
has its closing unescaped "
(they are balanced)\
also must be escaped if we want to create literal representing it (to create text representing \
we need to write it as \\
).Upvotes: 10
Views: 1967
Reputation: 626802
You can use the following regex:
"[^"\\]*(?:\\.[^"\\]*)*"|\S+
See the regex demo
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Explanation:
"[^"\\]*(?:\\.[^"\\]*)*"
- a double quote that is followed with any 0+ characters other than a "
and \
([^"\\]
) followed with 0+ sequences of any escaped sequence (\\.
) followed with any 0+ characters other than a "
and \
|
- or...\S+
- 1 or more non-whitespace charactersNOTE
@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+"
(or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+"
would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *
. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.
UPDATE
Since String[]
type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:
int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
result[idx] = matcher.group(0);
idx++;
}
System.out.println(Arrays.toString(result));
Upvotes: 10
Reputation: 1351
An alternative method that does not use a regex:
import java.util.ArrayList;
import java.util.Arrays;
public class SplitKeepingQuotationMarks {
public static void main(String[] args) {
String pattern = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
System.out.println(Arrays.toString(splitKeepingQuotationMarks(pattern)));
}
public static String[] splitKeepingQuotationMarks(String s) {
ArrayList<String> results = new ArrayList<>();
StringBuilder last = new StringBuilder();
boolean inString = false;
boolean wasBackSlash = false;
for (char c : s.toCharArray()) {
if (Character.isSpaceChar(c) && !inString) {
if (last.length() > 0) {
results.add(last.toString());
last.setLength(0); // Clears the s.b.
}
} else if (c == '"') {
last.append(c);
if (!wasBackSlash)
inString = !inString;
} else if (c == '\\') {
wasBackSlash = true;
last.append(c);
} else
last.append(c);
}
results.add(last.toString());
return results.toArray(new String[results.size()]);
}
}
Output:
[This, is, "a string", and, this, is, "a \"nested\" string"]
Upvotes: 2
Reputation: 7361
Another regex approach that works uses a negative lookbehind: "words" (\w+
) OR "quote followed by anything up to the next quote that ISN'T preceded by a backslash", and set your match to "global" (don't return on first match)
(\w+|".*?(?<!\\)")
Upvotes: 7