Reputation: 888
Consider the following code snippet:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
preg_match_all('/DELIM1(.*?)DELIM2(.*?)/', $example, $matches);
$matches
array becomes:
array:3 [
0 => array:2 [
0 => "DELIM1test1DELIM2"
1 => "DELIM1test3DELIM2"
]
1 => array:2 [
0 => "test1"
1 => "test3"
]
2 => array:2 [
0 => ""
1 => ""
]
]
As you can see, it fails to get test2
and test4
. Any reason why that happens and what could be a possible solution? Thank you.
Upvotes: 0
Views: 256
Reputation:
preg_split would be better:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
$keywords = preg_split("/DELIM1|DELIM2/", $example,0,PREG_SPLIT_NO_EMPTY);
print_r($keywords);
output:
Array
(
[0] => test1
[1] => test2
[2] => test3
[3] => test4
)
demo: http://ideone.com/s5nC0k
Upvotes: 3
Reputation: 360912
Those values are OUTSIDE of your anchors, so they won't get matched. e.g. (with some extra spaces)
str: DELIM1 test1 DELIM2 test2 DELIM1 test3 DELIM2 test4
pat: DELIM1 (.*?) DELIM2 (.*?) DELIM1 (.*?) DELIM2 (.*?)
match #1 match #2
(.*?)
is a non-greedy match, and can/will match a 0-length string. Since the boundary between M2
and te
is a 0-length string, that invisible zero-length character matches and the pattern terminates there.
Upvotes: 2
Reputation: 786329
You can use this negative lookahead regex:
preg_match_all('/DELIM1((?:(?!DELIM1|DELIM2).)*)DELIM2((?:(?!DELIM1|DELIM2).)*)/',
$example, $matches);
(?:(?!DELIM1|DELIM2).)*
will match 0 or more of any character that doesn't have DELIM1
or DELIM2
at next position.
Output:
print_r($matches);
Array
(
[0] => Array
(
[0] => DELIM1test1DELIM2test2
[1] => DELIM1test3DELIM2test4
)
[1] => Array
(
[0] => test1
[1] => test3
)
[2] => Array
(
[0] => test2
[1] => test4
)
)
Upvotes: 1
Reputation: 627600
Lazy subpatterns at the end of the patter match either 0 (*?
) or 1 (+?
) characters because they match as few as possible.
You can still use lazy matching and append a lookahead that will require a DELIM1 to appear after the value or the end of string:
/DELIM1(.*?)DELIM2(.*?)(?=$|DELIM1)/
See demo. It is very close in terms of performance with a tempered greedy token (DELIM1(.*?)DELIM2((?:(?!DELIM1).)*)
- demo).
However, the best approach is to unroll it:
DELIM1(.*?)DELIM2([^D]*(?:D(?!ELIM1)[^D]*)*)
See another demo
Upvotes: 3
Reputation: 198556
.*?
is non-greedy; if you have no constraint after it, it will match the minimum necessary: zero characters. You need a constraint after it to force it to match more than trivially. For example:
/DELIM1(.*?)DELIM2(.*?)(?=DELIM1|$)/
Upvotes: 3