Etienne Lehoux-Jobin
Etienne Lehoux-Jobin

Reputation: 23

preg_split : splitting a string according to a very specific pattern

Regex/PHP n00b here. I'm trying to use the PHP "preg_split" function...

I have strings that follow a very specific pattern according to which I want to split them.

Example of a string:

CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION

Desired result:

[0]CADAVRES
[1]FILM
[2]Canada : Québec
[3]Érik Canuel
[4]2009
[5]long métrage
[6]FICTION

Delimiters (in order of occurrence):

" ["
"] ("
", "
", "
", "
") "

How do I go about writing the regex correctly?

Here's what I've tried:

<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);

It's not working, and I don't understand what I'm doing wrong. Then again, I've just begun trying to deal with regex and PHP, so yeah... There are so many escape characters, I can't see right...

Thank you very much!

Upvotes: 2

Views: 983

Answers (3)

ggorlen
ggorlen

Reputation: 56965

Here's an attempt with preg_match:

$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);

Output:

Array
(
    [0] => CADAVRES 
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

Try it!

Regex breakdown:

^   anchor to start of string
 (    begin capture group 1
  [^\[]+   one or more non-left bracket characters
        )   end capture group 1
         \[   literal left bracket
           (   begin capture group 2
            [^\]]+   one or more non-right bracket characters
                  )    end capture group 2
                   \]   literal bracket
                     \s+    one or more spaces
                        \(    literal open parenthesis
                          (     open capture group 3
                           [^,]+   one or more non-comma characters
                                )     end capture group 3
                                 ,\s+     literal comma followed by one or more spaces
                                     ([^,]+),\s+([^,]+),\s+([^,]+)   repeats of the above
                                                                  \)   literal closing parenthesis
                                                                    \s+   one or more spaces
                                                                       (  begin capture group 7
                                                                        .+  everything else
                                                                           )  end capture group 7
                                                                            $ EOL

This assumes your structure to be static and is not particularly pretty, but on the other hand, should be robust to delimiters creeping into fields where they're not supposed to be. For example, the title having a : or , in it seems plausible and would break a "split on these delimiters anywhere"-type solution. For example,

"Matrix:, Trilogy()   [FILM, reviewed: good]    (Canada() :   Québec  ,  \t Érik Canuel , ): 2009 ,   long ():():[][]métrage) FICTIO  , [(:N";

correctly parses as:

Array
(
    [0] => Matrix:, Trilogy()   
    [1] => FILM, reviewed: good
    [2] => Canada() :   Québec  
    [3] => Érik Canuel 
    [4] => ): 2009 
    [5] => long ():():[][]métrage
    [6] => FICTIO  , [(:N
)

Try it!

Additionally, if your parenthesized comma region is variable length, you might want to extract that first and parse it, then handle the rest of the string.

Upvotes: 1

Nick
Nick

Reputation: 147166

You can use this regex to split on:

([^\w:]\s[^\w:]?|\s[^\w:])

It looks for a non-(word or :) character, followed by a space, followed by an optional non-(word or :) character; or a space followed by a non-(word or :) character. This will match all your desired split patterns. In PHP (note you need the u modifier to deal with unicode characters):

$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);

Output:

Array
(
    [0] => CADAVRES 
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

Demo on 3v4l.org

Upvotes: 3

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521339

I managed to work out a solution using preg_match_all:

$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\\[\\](),/\\s]+(?:(?: :)? [^-\\[\\](),/]+)?|", $input, $matches);
print_r($matches[0]);

Array
(
    [0] => CADAVRES
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

The above regex considers a term as any character which is not something like bracket, comma, parenthesis, etc. It also allows for two word terms, possibly with a colon separator in the middle.

Upvotes: 3

Related Questions