Reputation: 2032
I ran into a problem when trying to match all numbers found between spesific words on my page. How would you match all the numbers in the following text, but only between the word "begin" and "end"?
11
a
b
13
begin
t
899
y
50
f
end
91
h
This works:
preg_match("/begin(.*?)end/s", $text, $out);
preg_match_all("/[0-9]{1,}/", $out[1], $result);
But can it be done in one expression?
I tried this but it doesnt do the trick
preg_match_all("/begin.*([0-9]{1,}).*end/s", $text, $out);
Upvotes: 7
Views: 453
Reputation: 47991
Assuming your project data only has one begin
and end
"marker" in the text, you can build a more direct and efficient pattern...
Code: (PHP Demo) (Pattern Demo)
$text = "11
a
b
13
begin
t
899
y
50
f
end
91
h";
var_export(preg_match_all('~(?:begin|\G(?!^))(?:(?!end)\D)+\K\d+~s', $text, $out) ? $out[0] : 'no matches');
Output:
array (
0 => '899',
1 => '50',
)
Layman's Breakdown:
(?:begin|\G(?!^)) #match "begin" or continue matching from the position immediately after previous match
(?:(?!end)\D)*? #match zero or more occurrences of any non-digit character while screening for "end". If end is found, immediately cease pattern execution.
\K #restart the fullstring match from this position; this avoids the expense of using a capture group on the desired digits
\d+ #match one or more digits (as much as possible)
See the Pattern Demo link for a more academic breakdown of the pattern.
Upvotes: 0
Reputation: 43053
What is really needed here is a positive lookbehind with variable width. The regex would end up like this:
~(?<=begin.*)\d+(?=.*end)~s
However, as of this writing, the PHP regex flavor doesn't support this feature. Only lookbehind with fixed width is supported. (.Net flavor does though).
To acheive our goal, we can use preg_replace_callback
with the following regex:
~(?<token>begin|end)|(?<number>\d+)|.*?~s
function extract_number($input) {
function matchNumbers($match) {
static $in_region = false;
switch ($match['token']) {
case 'begin':
$in_region=true;
break;
case 'end':
$in_region=false;
break;
}
if ($in_region && isset($match['number'])) {
return $match['number'].',';
} else {
return '';
}
}
$ret=preg_replace_callback('~(?<token>begin|end)|(?<number>\d+)|.*?~s', 'matchNumbers', $input);
return array_filter(explode(',',$ret));
}
echo '<pre>';
echo var_dump(extract_number($str));
echo '</pre>';
array(3) {
[0]=>
string(3) "899"
[1]=>
string(2) "50"
}
Upvotes: 0
Reputation: 71578
You can make use of the \G
anchor like this, and some lookaheads to make sure that you're not going 'out of territory' (out of the area between the two words):
(?:begin|(?!^)\G)(?:(?=(?:(?!begin).)*end)\D)*?(\d+)
(?: # Begin of first non-capture group
begin # Match 'begin'
| # Or
(?!^)\G # Start the match from the previous end of match
) # End of first non-capture group
(?: # Second non-capture group
(?= # Positive lookahead
(?:(?!begin).)* # Negative lookahead to prevent running into another 'begin'
end # And make sure that there's an 'end' ahead
) # End positive lookahead
\D # Match non-digits
)*? # Second non-capture group repeated many times, lazily
(\d+) # Capture digits
A debuggex if that also helps:
Upvotes: 7