Googlebot
Googlebot

Reputation: 15683

Clean up a comma-separated list by regex

I want to clean up a tag list separated by comma to remove empty tags and extra spaces. I came up with

$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('#[,]{2,}#',',',preg_replace('#\s*,+\s*#',',',preg_replace('#\s+#s',' ',$str)));

which works well so far, but is it possible to do it in one replacement?

Upvotes: 1

Views: 250

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626871

You can use

preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str)

Merging the two alternatives into one results in

preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str)

See the regex demo and the PHP demo. Details:

  • \s*(?:(,)\s*)+ - zero or more whitespaces and then one or more occurrences of a comma (captured into Group 1 ($1)) and then zero or more whitespaces
  • | - or
  • (\s)+ - one or more whitespaces while capturing the last one into Group 2 ($2).

In the second regex, ([,\s]) captures a single comma or a whitespace character.

The second regex matches:

  • \s* - zero or more whitespaces
  • (?:([,\s])\s*)+ - one or more occurrences of
    • ([,\s]) - Group 1 ($1): a comma or a whitespace
    • \s* - zero or more whitespaces

See the PHP demo:

<?php
 
$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str) . PHP_EOL;
echo preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str);
// => first,second,third,fourth suffix
//    first,second,third,fourth suffix

BONUS

This solution is portable to all NFA regex flavors, here is a JavaScript demo:

const str = 'first , second ,, third, ,fourth   suffix';
console.log(str.replace(/\s*(?:(,)\s*)+|(\s)+/g, '$1$2'));
console.log(str.replace(/\s*(?:([,\s])\s*)+/g, '$1'));

It can even be adjusted for use in POSIX tools like sed:

sed -E 's/[[:space:]]*(([,[:space:]])[[:space:]]*)+/\2/g' file > outputfile

See the online demo.

Upvotes: 1

JvdV
JvdV

Reputation: 75860

You can use:

[\h*([,\h])[,\h]*

See an online demo. Or alternatively:

\h*([,\h])(?1)*

See an online demo


  • \h* - 0+ (Greedy) horizontal-whitespace chars;
  • ([,\h]) - A 1st capture group to match a comma or horizontal-whitespace;
  • [,\h]* - Option 1: 0+ (Greedy) comma's or horizontal-whitespace chars;
  • (?1)* - Option 2: Recurse the 1st subpattern 0+ (Greedy) times.

Replace with the 1st capture group:

$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('~\h*([,\h])[,\h]*~', '$1', $str);
echo preg_replace('~\h*([,\h])(?1)*~', '$1', $str);

Both print:

first,second,third,fourth suffix

Upvotes: 3

Related Questions