user1365914
user1365914

Reputation: 888

PHP preg_match_all does not match everything

Consider the following code snippet:

$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on

preg_match_all('/DELIM1(.*?)DELIM2(.*?)/', $example, $matches);

$matches array becomes:

array:3 [
  0 => array:2 [
    0 => "DELIM1test1DELIM2"
    1 => "DELIM1test3DELIM2"
  ]
  1 => array:2 [
    0 => "test1"
    1 => "test3"
  ]
  2 => array:2 [
    0 => ""
    1 => ""
  ]
]

As you can see, it fails to get test2 and test4. Any reason why that happens and what could be a possible solution? Thank you.

Upvotes: 0

Views: 256

Answers (5)

user557846
user557846

Reputation:

preg_split would be better:

$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
$keywords = preg_split("/DELIM1|DELIM2/", $example,0,PREG_SPLIT_NO_EMPTY);
print_r($keywords);

output:

Array
(
    [0] => test1
    [1] => test2
    [2] => test3
    [3] => test4
)

demo: http://ideone.com/s5nC0k

Upvotes: 3

Marc B
Marc B

Reputation: 360912

Those values are OUTSIDE of your anchors, so they won't get matched. e.g. (with some extra spaces)

str:  DELIM1  test1  DELIM2         test2   DELIM1  test3  DELIM2        test4
pat:  DELIM1  (.*?)  DELIM2  (.*?)          DELIM1  (.*?)  DELIM2 (.*?) 
             match #1                                match #2

(.*?) is a non-greedy match, and can/will match a 0-length string. Since the boundary between M2 and te is a 0-length string, that invisible zero-length character matches and the pattern terminates there.

Upvotes: 2

anubhava
anubhava

Reputation: 786329

You can use this negative lookahead regex:

preg_match_all('/DELIM1((?:(?!DELIM1|DELIM2).)*)DELIM2((?:(?!DELIM1|DELIM2).)*)/',
                $example, $matches);

(?:(?!DELIM1|DELIM2).)* will match 0 or more of any character that doesn't have DELIM1 or DELIM2 at next position.

Output:

print_r($matches);

    Array
    (
        [0] => Array
            (
                [0] => DELIM1test1DELIM2test2
                [1] => DELIM1test3DELIM2test4
            )

        [1] => Array
            (
                [0] => test1
                [1] => test3
            )

        [2] => Array
            (
                [0] => test2
                [1] => test4
            )        
    )

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627600

Lazy subpatterns at the end of the patter match either 0 (*?) or 1 (+?) characters because they match as few as possible.

You can still use lazy matching and append a lookahead that will require a DELIM1 to appear after the value or the end of string:

/DELIM1(.*?)DELIM2(.*?)(?=$|DELIM1)/

See demo. It is very close in terms of performance with a tempered greedy token (DELIM1(.*?)DELIM2((?:(?!DELIM1).)*) - demo).

However, the best approach is to unroll it:

DELIM1(.*?)DELIM2([^D]*(?:D(?!ELIM1)[^D]*)*)

See another demo

Upvotes: 3

Amadan
Amadan

Reputation: 198556

.*? is non-greedy; if you have no constraint after it, it will match the minimum necessary: zero characters. You need a constraint after it to force it to match more than trivially. For example:

/DELIM1(.*?)DELIM2(.*?)(?=DELIM1|$)/

Upvotes: 3

Related Questions