Sam Stern
Sam Stern

Reputation: 25134

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.

For example:

(this is a test "sentence for the regex")

should become

(thisisatest"sentence for the regex")

Upvotes: 41

Views: 23467

Answers (6)

Andrew Wei
Andrew Wei

Reputation: 2080

This isn't an exact solution, but you can accomplish your goal by doing the following:

STEP 1: Match the two segments

\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)

STEP 2: remove spaces

temp = $1 replace " " with ""

STEP 3: rebuild your string

(temp"$2")

Upvotes: 0

Siva Kranthi Kumar
Siva Kranthi Kumar

Reputation: 1348

Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)

\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)

It won't work with the strings which has quotes inside.

Regular expression visualization

Upvotes: 20

Bart Kiers
Bart Kiers

Reputation: 170148

Here's a single regex-replace that works:

\s+(?=([^"]*"[^"]*")*[^"]*$)

which will replace:

(this is a test "sentence for the regex" foo bar)

with:

(thisisatest"sentence for the regex"foobar)

Note that if the quotes can be escaped, the even more verbose regex will do the trick:

\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)

which replaces the input:

(this is a test "sentence \"for the regex" foo bar)

with:

(thisisatest"sentence \"for the regex"foobar)

(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))

Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.

EDIT

A quick demo:

String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));

// output: (thisisatest"sentence \"for the regex"foobar)

Upvotes: 62

anomal
anomal

Reputation: 2299

If there is only one set of quotes, you can do this:

    String s = "(this is a test \"sentence for the regex\") a b c";

    Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
    while (matcher.find())
    {
        String group = matcher.group();
        s = s.replace(group, group.replaceAll("\\s", ""));
    }

    System.out.println(s); // (thisisatest"sentence for the regex")abc

Upvotes: 1

Edmund
Edmund

Reputation: 10809

Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.

Perhaps something like:

(\s+)([^ "]+|"[^"]*")*

The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.

This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:

StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
    if (m.group(2) != null)
        b.append(m.group(2));
}
System.out.println(b.toString());

(I haven't done much regex in Java so expect bugs.)

Finally This is how I'd do it if regexes were compulsory. ;-)

As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.

Upvotes: 1

Xavier Holt
Xavier Holt

Reputation: 14619

This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.

I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.

Hope that helps!

PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.

PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.

Upvotes: 2

Related Questions