SAVAFA
SAVAFA

Reputation: 818

exploding a string using a regular expression

I have a string as below (the letters in the example could be numbers or texts and could be either uppercase or lowercase or both. If a value is a sentence, it should be between single quotations):

$string="a,b,c,(d,e,f),g,'h, i j.',k";

How can I explode that to get the following result?

Array([0]=>"a",[1]=>"b",[2]=>"c",[3]=>"(d,e,f)",[4]=>"g",[5]=>"'h,i j'",[6]=>"k")

I think using regular expressions will be a fast as well as clean solution. Any idea?

EDIT: This is what I have done so far, which is very slow for the strings having a long part between parenthesis:

$separator="*"; // whatever which is not used in the string
$Pattern="'[^,]([^']+),([^']+)[^,]'";
while(ereg($Pattern,$String,$Regs)){
    $String=ereg_replace($Pattern,"'\\1$separator\\2'",$String);
}

$Pattern="\(([^(^']+),([^)^']+)\)";
while(ereg($Pattern,$String,$Regs)){
    $String=ereg_replace($Pattern,"(\\1$separator\\2)",$String);
}

return $String;

This, will replace all the commas between the parenthesis. Then I can explode it by commas and the replace the $separator with the original comma.

Upvotes: 2

Views: 7330

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89567

You can do the job using preg_match_all

$string="a,b,c,(d,e,f),g,'h, i j.',k";

preg_match_all("~'[^']+'|\([^)]+\)|[^,]+~", $string, $result);
print_r($result[0]);

Explanation:

The trick is to match parenthesis before the ,

~          Pattern delimiter
'
[^']       All charaters but not a single quote
+          one or more times 
'
|          or
\([^)]+\)  the same with parenthesis
|          or
[^,]+      Any characters except commas one or more times
~

Note that the quantifiers in [^']+', in [^)]+\) but also in [^,]+ are all automatically optimized to possessive quantifiers at compile time due to "auto-possessification". The first two because the character class doesn't contain the next character, and the last because it is at the end of the pattern. In both cases, an eventual backtracking is unnecessary.

if you have more than one delimiter like quotes (that are the same for open and close), you can write your pattern like this, using a capture group:

$string="a,b,c,(d,e,f),g,'h, i j.',k,°l,m°,#o,p#,@q,r@,s";

preg_match_all('~([\'#@°]).*?\1|\([^)]+\)|[^,]+~', $string, $result);
print_r($result[0]);

explanation:

(['#@°])   one character in the class is captured in group 1
.*?        any character zero or more time in lazy mode 
\1         group 1 content

With nested parenthesis:

$string="a,b,(c,(d,(e),f),t),g,'h, i j.',k,°l,m°,#o,p#,@q,r@,s";

preg_match_all('~([\'#@°]).*?\1|(\((?:[^()]+|(?-1))*+\))|[^,]+~', $string, $result);
print_r($result[0]);

Upvotes: 6

Related Questions