Reputation: 23
Regex/PHP n00b here. I'm trying to use the PHP "preg_split" function...
I have strings that follow a very specific pattern according to which I want to split them.
Example of a string:
CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION
Desired result:
[0]CADAVRES [1]FILM [2]Canada : Québec [3]Érik Canuel [4]2009 [5]long métrage [6]FICTION
Delimiters (in order of occurrence):
" [" "] (" ", " ", " ", " ") "
How do I go about writing the regex correctly?
Here's what I've tried:
<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);
It's not working, and I don't understand what I'm doing wrong. Then again, I've just begun trying to deal with regex and PHP, so yeah... There are so many escape characters, I can't see right...
Thank you very much!
Upvotes: 2
Views: 983
Reputation: 56965
Here's an attempt with preg_match
:
$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);
Output:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Regex breakdown:
^ anchor to start of string
( begin capture group 1
[^\[]+ one or more non-left bracket characters
) end capture group 1
\[ literal left bracket
( begin capture group 2
[^\]]+ one or more non-right bracket characters
) end capture group 2
\] literal bracket
\s+ one or more spaces
\( literal open parenthesis
( open capture group 3
[^,]+ one or more non-comma characters
) end capture group 3
,\s+ literal comma followed by one or more spaces
([^,]+),\s+([^,]+),\s+([^,]+) repeats of the above
\) literal closing parenthesis
\s+ one or more spaces
( begin capture group 7
.+ everything else
) end capture group 7
$ EOL
This assumes your structure to be static and is not particularly pretty, but on the other hand, should be robust to delimiters creeping into fields where they're not supposed to be. For example, the title having a :
or ,
in it seems plausible and would break a "split on these delimiters anywhere"-type solution. For example,
"Matrix:, Trilogy() [FILM, reviewed: good] (Canada() : Québec , \t Érik Canuel , ): 2009 , long ():():[][]métrage) FICTIO , [(:N";
correctly parses as:
Array
(
[0] => Matrix:, Trilogy()
[1] => FILM, reviewed: good
[2] => Canada() : Québec
[3] => Érik Canuel
[4] => ): 2009
[5] => long ():():[][]métrage
[6] => FICTIO , [(:N
)
Additionally, if your parenthesized comma region is variable length, you might want to extract that first and parse it, then handle the rest of the string.
Upvotes: 1
Reputation: 147166
You can use this regex to split on:
([^\w:]\s[^\w:]?|\s[^\w:])
It looks for a non-(word or :
) character, followed by a space, followed by an optional non-(word or :
) character; or a space followed by a non-(word or :
) character. This will match all your desired split patterns. In PHP (note you need the u
modifier to deal with unicode characters):
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);
Output:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Upvotes: 3
Reputation: 521339
I managed to work out a solution using preg_match_all
:
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\\[\\](),/\\s]+(?:(?: :)? [^-\\[\\](),/]+)?|", $input, $matches);
print_r($matches[0]);
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
The above regex considers a term as any character which is not something like bracket, comma, parenthesis, etc. It also allows for two word terms, possibly with a colon separator in the middle.
Upvotes: 3