Gregory Bologna
Gregory Bologna

Reputation: 286

Split string string on the nth semicolon in a string

I need help finding a PCRE pattern using preg_split().

I'm using the regex pattern below to split a string based on its starting 3 character code and semi-colons. The pattern works fine in Javascript, but now I need to use the pattern in PHP. I tried preg_split() but just getting back junk.

// Each group will begin with a three letter code, have three segments separated by a semi-colon. The string will not be terminated with a semi-colon.

// Pseudocode    
string_to_split = "AAA;RED;111;BBB;BLUE;22;CCC;GREEN;33;DDD;WHITE;44"

// This works in JS  
// https://regex101.com  
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/gi";

Match 1  
Full match  0-11    `AAA;RED;111`  
Match 2  
Full match  12-23   `BBB;BLUE;22`  
Match 3  
Full match  24-36   `CCC;GREEN;33`  
Match 4  
Full match  37-49   `DDD;WHITE;44`  

$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/";  
$split = preg_split($pattern, $string_to_split);

returns

array(5)  
    0:""  
    1:";"  
    2:";"  
    3:";"  
    4:""  

Upvotes: 0

Views: 907

Answers (4)

mickmackusa
mickmackusa

Reputation: 47934

Validation is NOT the super-power of preg_split(). Based on your provided input, you shouldn't need to strictly match the start each segment. If there is an invalid start, your splitting task would result in a segment of incorrect/unexpected length. If you need to validate segments, then preg_match_all() would be the go-to tool.

If there is no validation to do, just split on every 3rd semicolon. My pattern below match one or more non-semicolons then a semicolon -- 3 times. On each of those three continuous matches, \K forgets/releases the previously matched characters -- this effectively means that only the last semicolon is consumed in the explosion. Demo

$string = "AAA;RED;111;BBB;BLUE;22;CCC;GREEN;33;DDD;WHITE;44";
var_export(
    preg_split('/(?:[^;]+\K;){3}/', $string)
);

Output:

array (
  0 => 'AAA;RED;111',
  1 => 'BBB;BLUE;22',
  2 => 'CCC;GREEN;33',
  3 => 'DDD;WHITE;44',
)

Upvotes: 1

fubar
fubar

Reputation: 17388

I've modified your pattern a little, and added a couple of flags to preg_split.

The PREG_SPLIT_NO_EMPTY flag will exclude empty matches from the result, and PREG_SPLIT_DELIM_CAPTURE will include the captured value in the result.

$split = preg_split('/([abcd]{3};[^;]+;\d+);?/i', $string, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);

Result:

Array
(
    [0] => AAA;RED;111
    [1] => BBB;BLUE;22
    [2] => CCC;GREEN;33
    [3] => DDD;WHITE;44
)

Alternatively, and more suitably, you can use preg_match_all with the following pattern.

preg_match_all('/([abcd]{3};[^;]+;\d+);?/i', $string, $matches);

print_r($matches[0]);

Result:

Array
(
    [0] => AAA;RED;111
    [1] => BBB;BLUE;22
    [2] => CCC;GREEN;33
    [3] => DDD;WHITE;44
)

Upvotes: 1

Pinke Helga
Pinke Helga

Reputation: 6692

According to your additional information in some comments to the answers, I update my answer to be very specific to your source format.

You might want something like this:

$subject = "AAA;RED;111;AAA;Oh my dog;12.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";

$pattern = '/(?<=;|^)(AAA|BBB|CCC|DDD);([^;]*);((?:\d*\.)?\d+)(?=;|$)/';
preg_match_all($pattern, $subject,$matches);
var_dump($matches);

giving you

array (size=4)
  0 =>
    array (size=8)
      0 => string 'AAA;RED;111' (length=11)
      1 => string 'AAA;Oh my dog;12.34' (length=19)
      2 => string 'AAA;Oh Long John;.4556' (length=22)
      3 => string 'BBB;Oh Long Johnson;1.2323' (length=26)
      4 => string 'BBB;Oh Don Piano;.33' (length=20)
      5 => string 'CCC;Why I eyes ya;1.445' (length=23)
      6 => string 'CCC;All the live long day;2.3343' (length=32)
      7 => string 'DDD;Faith Hilling;.89' (length=21)
  1 =>
    array (size=8)
      0 => string 'AAA' (length=3)
      1 => string 'AAA' (length=3)
      2 => string 'AAA' (length=3)
      3 => string 'BBB' (length=3)
      4 => string 'BBB' (length=3)
      5 => string 'CCC' (length=3)
      6 => string 'CCC' (length=3)
      7 => string 'DDD' (length=3)
  2 =>
    array (size=8)
      0 => string 'RED' (length=3)
      1 => string 'Oh my dog' (length=9)
      2 => string 'Oh Long John' (length=12)
      3 => string 'Oh Long Johnson' (length=15)
      4 => string 'Oh Don Piano' (length=12)
      5 => string 'Why I eyes ya' (length=13)
      6 => string 'All the live long day' (length=21)
      7 => string 'Faith Hilling' (length=13)
  3 =>
    array (size=8)
      0 => string '111' (length=3)
      1 => string '12.34' (length=5)
      2 => string '.4556' (length=5)
      3 => string '1.2323' (length=6)
      4 => string '.33' (length=3)
      5 => string '1.445' (length=5)
      6 => string '2.3343' (length=6)
      7 => string '.89' (length=3)

The start marker should occur at the start of string or immidiately after a semicolon, so we do a lookbehind, looking for start or semicolon:

(?<=;|^)

We look for an alternative of AAA,BBB,CCC or DDD and capture it:

(AAA|BBB|CCC|DDD)

After a semicolon we look for any character except a semicolon. The quantifier * means 0 or more time. Use + if you want at least 1.

;([^;]*)

After the next semicolon wie look for a number. This task has to be splitted to fit a valid format: We first look for 0 or more digits followed by a dot:

(?:\d*\.)?

where (?:) means a non-capturing group.

Behind we look for at least one digit: \d+

We want to capture both parts of of the number using parentheses after the searched semicolon:

;((?:\d*\.)?\d+)

This matches "1234", ".1234", "1.234", "12.34" , "123.4" but "1234.", "1.2.3"

Finally we want this to immediately occur before a semicolon or the end of string. Thus we do a lookahead:

(?=;|$)

Lookaheads and lookbehinds are not part of the captured result behind or respectively before.

Upvotes: 1

Toto
Toto

Reputation: 91428

You don't want to split your string but match elements, use preg_match_all:

$str = "AAA;RED;111;AAA;Oh my dog;2.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$res = preg_match_all('/(?:AAA|BBB|CCC|DDD);[^;]*;[^;]*;?/', $str, $m);
print_r($m[0]);

Output:

Array
(
    [0] => AAA;RED;111;
    [1] => AAA;Oh my dog;2.34;
    [2] => AAA;Oh Long John;.4556;
    [3] => BBB;Oh Long Johnson;1.2323;
    [4] => BBB;Oh Don Piano;.33;
    [5] => CCC;Why I eyes ya;1.445;
    [6] => CCC;All the live long day;2.3343;
    [7] => DDD;Faith Hilling;.89
)

Explanation:

/                       : regex delimiter
  (?:AAA|BBB|CCC|DDD)   : non capture group AAA or BBB or CCC or DDD
  ;                     : a semicolon
  [^;]*                 : 0 or more any character that is not a semicolon
  ;                     : a semicolon
  [^;]*                 : 0 or more any character that is not a semicolon
  ;?                    : optional semicolon
/                       : regex delimiter

Upvotes: 0

Related Questions