Reputation: 12027
I'm trying to find strings that contain a domain. I have the following pattern:
"|s:\\d+:\\\\\"((?:.(?!s:\\d+))+?){$domain}(.+?)\\\\\";|"
This (pattern) seems to work, but I get only the first two matches in PHP.
$filename = "caciki_tr.sql";
$domain = "caciki.com.tr";
$domain = escape($domain, ".");
$content = file_get_contents($filename);
$pattern = "|s:\\d+:\\\\\"((?:.(?!s:\\d+))+?){$domain}(.+?)\\\\\";|";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
print_r($matches);
function escape($string, $chars) {
$chars = str_split($chars);
foreach ($chars as $char) {
$string = str_replace($char, "\\{$char}", $string);
}
return $string;
}
Array
(
[0] => Array
(
[0] => s:121:\"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php\";
[1] => /home/caciki/domains/
[2] => /public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php
)
[1] => Array
(
[0] => s:81:\"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/style.css\";
[1] => /home/caciki/domains/
[2] => /public_html/wp-content/themes/rafine/style.css
)
)
I get the all matches (11) only when I tinker with the target file. Something must be breaking the pattern/PHP.
I've tested the same pattern in Python and C#, and they give the correct result:
So what's wrong here?
caciki_tr.sql (target file)
Update: The pattern here is used with different substrings (e.g., domain, url, username, etc.). Not all strings in the target file follows the same pattern. For example, a pattern for URLs should be able to match the following:
$url = "http://[DOMAIN_OMITTED]/~caciki";
$pattern = "|s:\d+:\\\\\"([^s]*(?:s(?!:\d)[^s]*)*){$url}(.+?)\\\\\";|";
s:28:\"http://[DOMAIN_OMITTED]/~caciki\";
s:28:\"<a href=\"http://[DOMAIN_OMITTED]/~caciki\">some page</a>\";
In short, there might not be a string between the s:28:\"
and the substring ($url), or after the substring. So it should be optional.
Upvotes: 1
Views: 507
Reputation: 626870
The current pattern is rather inefficient as it contains a corrupt "tempered greedy token", (?:.(?!s:\d+))+?
. This is a very inefficient construct that should be "unwrapped" if you want to use such a regex in production.
You may use [^s]*(?:s(?!:\d)[^s]*)*
instead of it:
"|s:\d+:\\\\\"([^s]*(?:s(?!:\d)[^s]*)*)$domain(.+?)\\\\\";|'
^^^^^^^^^^^^^^^^^^^^^^^
Details
[^s]*
- 0+ chars other than s
(?:
- a non-capturing group repeating...
s(?!:\d)
- s
not followed with :
+ a digit[^s]*
- 0+ chars other than s
)*
- zero or more times.Note that if you plan to work with big files make sure your patterns are as efficient as possible. Also, here is an interesting solution in case you want to work with large files (pcregrep
is a very fast tool).
Upvotes: 1