seano
seano

Reputation: 75

regex to remove all whitespaces except between brackets

I've been wrestling with an issue I was hoping to solve with regex.

Let's say I have a string that can contain any alphanumeric with the possibility of a substring within being surrounded by square brackets. These substrings could appear anywhere in the string like this. There can also be any number of bracket-ed substrings.

Examples:

You can see that there are whitespaces in some of the bracketed substrings, that's fine. My main issue right now is when I encounter spaces outside of the brackets like this:

Now I want to preserve the spaces inside the brackets but remove them everywhere else.

This gets a little more tricky for strings like:

Here I would want the return to be:

I spent some time now reading through different reg ex pages regarding lookarounds, negative assertions, etc. and it's making my head spin.

NOTE: for anyone visiting this, I was not looking for any solution involving nested brackets. If that was the case I'd probably do it pragmatically like some of the comments mentioned below.

Upvotes: 6

Views: 5904

Answers (6)

zx81
zx81

Reputation: 41838

Resurrecting this question because it had a simple solution that wasn't mentioned.

\[[^]]*\](*SKIP)(*F)|\s+

The left side of the alternation matches complete sets of brackets then deliberately fails. The right side matches and captures spaces to Group 1, and we know they are the right spaces because if they were within brackets they would have been failed by the expression on the left.

See the matches in this demo

This means you can just do

$replace = preg_replace("~\[[^]]*\](*SKIP)(*F)|\s+~","",$string);

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...

Upvotes: 2

Senseful
Senseful

Reputation: 91671

This regex should do the trick:

[ ](?=[^\]]*?(?:\[|$))

Just replace the space that was matched with "".

Basically all it's doing is making sure that the space you are going to remove has a "[" in front of it, but not if it has a "]" before it.

That should work as long as you don't have nested square brackets, e.g.:

a a[b [c c]b]

Because in that case, the space after the first "b" will be removed and it will become:

aa[b[c c]b]

Upvotes: 14

Steve Wortham
Steve Wortham

Reputation: 22220

This works for me:

(\[.+?\])|\s

Then you simply pass in a replacement value of $1 when you call the replace function. The idea is to look for the patterns inside the brackets first and make sure they're untouched. And then every space outside the brackets gets replaced with nothing.

Note that I tested this with Regex Hero (a .NET regex tester), and not in PHP. So I'm not 100% sure this will work for you.

That was an interesting one. Sounded simple at first, then seemed rather difficult. And then the solution I finally arrived at was indeed simple. I was surprised the solution didn't require a lookaround of any sort. And it should be faster than any method that uses a lookaround.

Upvotes: 2

Draemon
Draemon

Reputation: 34711

The following will match start-of-line or end-of-bracket (which must come before any space you want to match) followed by anything that isn't start-of-bracket or a space, followed by some space.

/((^|\])[^ \[]*) +/

replacing "all" with $1 will remove the first block of spaces from each non-bracketed sequence. You will have to repeat the match to remove all spaces.

Example:

abcd efg [hij klm]nop qrst u
abcdefg [hij klm]nopqrst u
abcdefg[hij klm]nopqrstu
done

Upvotes: 0

Cascabel
Cascabel

Reputation: 496772

This doesn't sound like something you really want regex for. It's very easy to parse directly by reading through. Pseudo-code:

inside_brackets = false;
for ( i = 0; i < length(str); i++) {
    if (str[i] == '[' )
        inside_brackets = true;
    else if str[i] == ']'
        inside_brackets = false;
    if ( ! inside_brackets && is_space(str[i]) )
        delete(str[i]);
}

Anything involving regex is going to involve a lot of lookbehind stuff, which will be repeated over and over, and it'll be much slower and less comprehensible.

To make this work for nested brackets, simply change inside_brackets to a counter, starting at zero, incrementing on open brackets, and decrementing on close brackets.

Upvotes: 8

derobert
derobert

Reputation: 51137

How to do this depends on what should be done with:

a b [ c [ d [ e ] f ] g

That is ambiguous; possible answers are at least:

  • ab[ c [ d [ e ] f ]g
  • ab[ c [ d [ e ]f]g
  • error out; the brackets don't match!

For the first two cases, you can use regexps. For the third case, you'd be much better off with a (small) parser.

For either case one or two, split the string on the first [. Strip spaces from everything before [ (that's obviously outside of the brackets). Next, look for .*\] (case 1) or .*?\] (case 2) and move that over to your output. Repeat until you're out of input.

Upvotes: 1

Related Questions