bobasti
bobasti

Reputation: 1918

Splitting a nested string keeping quotation marks

I am working on a project in Java that requires having nested strings.

For an input string that in plain text looks like this:

This is "a string" and this is "a \"nested\" string"

The result must be the following:

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"

Note that I want the \" sequences to be kept.
I have the following method:

public static String[] splitKeepingQuotationMarks(String s);

and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.

I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?

UPDATE based on questions from comments:

Upvotes: 10

Views: 1967

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

You can use the following regex:

"[^"\\]*(?:\\.[^"\\]*)*"|\S+

See the regex demo

Java demo:

String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Explanation:

  • "[^"\\]*(?:\\.[^"\\]*)*" - a double quote that is followed with any 0+ characters other than a " and \ ([^"\\]) followed with 0+ sequences of any escaped sequence (\\.) followed with any 0+ characters other than a " and \
  • | - or...
  • \S+ - 1 or more non-whitespace characters

NOTE

@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+" (or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.

UPDATE

Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:

int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
    result[idx] = matcher.group(0);
    idx++;
}
System.out.println(Arrays.toString(result));

See another IDEONE demo

Upvotes: 10

Majora320
Majora320

Reputation: 1351

An alternative method that does not use a regex:

import java.util.ArrayList;
import java.util.Arrays;

public class SplitKeepingQuotationMarks {
    public static void main(String[] args) {
        String pattern = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
        System.out.println(Arrays.toString(splitKeepingQuotationMarks(pattern)));
    }

    public static String[] splitKeepingQuotationMarks(String s) {
        ArrayList<String> results = new ArrayList<>();
        StringBuilder last = new StringBuilder();
        boolean inString = false;
        boolean wasBackSlash = false;

        for (char c : s.toCharArray()) {
            if (Character.isSpaceChar(c) && !inString) {
                if (last.length() > 0) {
                    results.add(last.toString());
                    last.setLength(0); // Clears the s.b.
                }
            } else if (c == '"') {
                last.append(c);
                if (!wasBackSlash)
                    inString = !inString;
            } else if (c == '\\') {
                wasBackSlash = true;
                last.append(c);
            } else
                last.append(c); 
        }

        results.add(last.toString());
        return results.toArray(new String[results.size()]);
    }
}

Output:

[This, is, "a string", and, this, is, "a \"nested\" string"]

Upvotes: 2

Scott Weaver
Scott Weaver

Reputation: 7361

Another regex approach that works uses a negative lookbehind: "words" (\w+) OR "quote followed by anything up to the next quote that ISN'T preceded by a backslash", and set your match to "global" (don't return on first match)

(\w+|".*?(?<!\\)")

see it here.

Upvotes: 7

Related Questions