mozlima
mozlima

Reputation: 179

regexp split string by commas and spaces, but ignore the inside quotes and parentheses

I need split string by commas and spaces, but ignore the inside quotes, single quotes and parentheses

$str = "Questions, \"Quote\",'single quote','comma,inside' (inside parentheses) space #specialchar";

so that the resultant array will have

[0]Questions
[1]Quote
[2]single quote
[3]comma,inside
[4]inside parentheses
[5]space
[6]#specialchar

my atual regexp is

$tags = preg_split("/[,\s]*[^\w\s]+[\s]*/", $str,0,PREG_SPLIT_NO_EMPTY);

but this is ignoring special chars and stil split the commas inside quotes, the resultant array is :

[0]Questions
[1]Quote
[2]single quote
[3]comma
[4]inside
[5]inside parentheses
[6]space
[7]specialchar

ps: this is no csv

Many Thanks

Upvotes: 5

Views: 5628

Answers (2)

Alan Moore
Alan Moore

Reputation: 75222

Well, this works for the data you supplied:

$rgx = <<<'EOT'
/
  [,\s]++
  (?=(?:(?:[^"]*+"){2})*+[^"]*+$)
  (?=(?:(?:[^']*+'){2})*+[^']*+$)
  (?=(?:[^()]*+\([^()]*+\))*+[^()]*+$)
/x
EOT;

The lookaheads assert that if there are any double-quotes, single-quotes or parentheses ahead of the current match position there's an even number of them, and the parens are in balanced pairs (no nesting allowed). That's a quick-and-dirty way to ensure that the current match isn't occurring inside a pair of quotes or parens.

Of course, it assumes the input is well formed. But on the subject of of well-formedness, what about escaped quotes within quotes? What if you have quotes inside parens, or vice-versa? Would this input be legal?

"not a \" quote", 'not a ) quote', (not ",' quotes)

If so, you've got a much more difficult job ahead of you.

Upvotes: 2

Inshallah
Inshallah

Reputation: 4814

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\\\]++|\\\\.)*+ ) \"
     | '  ( (?:[^'\\\\]++|\\\\.)*+ ) \'
     | \( ( [^)]*                  ) \)
     | [\s,]+
    /x
    HERE;

    $tags = preg_split($regex, $str, -1,
                         PREG_SPLIT_NO_EMPTY
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Upvotes: 6

Related Questions