preg_match_all doesn't find all matches

Question

I'm trying to find strings that contain a domain. I have the following pattern:

"|s:\d+:\\"((?:.(?!s:\d+))+?){$domain}(.+?)\\";|"

This (pattern) seems to work, but I get only the first two matches in PHP.

$filename = "caciki_tr.sql";
$domain   = "caciki.com.tr";

$domain   = escape($domain, ".");

$content = file_get_contents($filename);

$pattern = "|s:\d+:\\"((?:.(?!s:\d+))+?){$domain}(.+?)\\";|";

preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
print_r($matches);

function escape($string, $chars) {
    $chars = str_split($chars);
    foreach ($chars as $char) {
        $string = str_replace($char, "\{$char}", $string);
    }
    return $string;
}

Array
(
    [0] => Array
        (
            [0] => s:121:"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php";
            [1] => /home/caciki/domains/
            [2] => /public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php
        )

    [1] => Array
        (
            [0] => s:81:"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/style.css";
            [1] => /home/caciki/domains/
            [2] => /public_html/wp-content/themes/rafine/style.css
        )

)

I get the all matches (11) only when I tinker with the target file. Something must be breaking the pattern/PHP.

I've tested the same pattern in Python and C#, and they give the correct result:

So what's wrong here?

caciki_tr.sql (target file)

Update: The pattern here is used with different substrings (e.g., domain, url, username, etc.). Not all strings in the target file follows the same pattern. For example, a pattern for URLs should be able to match the following:

$url = "http://[DOMAIN_OMITTED]/~caciki";
$pattern = "|s:\d+:\\"([^s]*(?:s(?!:\d)[^s]*)*){$url}(.+?)\\";|";

s:28:"http://[DOMAIN_OMITTED]/~caciki";
s:28:"some page";

In short, there might not be a string between the s:28:" and the substring ($url), or after the substring. So it should be optional.

Wiktor Stribiżew · Accepted Answer

The current pattern is rather inefficient as it contains a corrupt "tempered greedy token", (?:.(?!s:\d+))+?. This is a very inefficient construct that should be "unwrapped" if you want to use such a regex in production.

You may use [^s]*(?:s(?!:\d)[^s]*)* instead of it:

"|s:\d+:\\"([^s]*(?:s(?!:\d)[^s]*)*)$domain(.+?)\\";|'
               ^^^^^^^^^^^^^^^^^^^^^^^

Details

[^s]* - 0+ chars other than s
(?: - a non-capturing group repeating...
- s(?!:\d) - s not followed with : + a digit
- [^s]* - 0+ chars other than s
)* - zero or more times.

Note that if you plan to work with big files make sure your patterns are as efficient as possible. Also, here is an interesting solution in case you want to work with large files (pcregrep is a very fast tool).

preg_match_all doesn't find all matches

Answers (1)

Related Questions

preg_match_all doesn&#39;t find all matches

Answers (1)

Related Questions

preg_match_all doesn't find all matches