Kristian Rafteseth
Kristian Rafteseth

Reputation: 2032

How to work around PHP lookbehind fixed width limitation?

I ran into a problem when trying to match all numbers found between spesific words on my page. How would you match all the numbers in the following text, but only between the word "begin" and "end"?

11
a
b
13
begin
t
899
y
50
f
end
91
h

This works:

preg_match("/begin(.*?)end/s", $text, $out);
preg_match_all("/[0-9]{1,}/", $out[1], $result);

But can it be done in one expression?

I tried this but it doesnt do the trick

preg_match_all("/begin.*([0-9]{1,}).*end/s", $text, $out);

Upvotes: 7

Views: 453

Answers (3)

mickmackusa
mickmackusa

Reputation: 47991

Assuming your project data only has one begin and end "marker" in the text, you can build a more direct and efficient pattern...

Code: (PHP Demo) (Pattern Demo)

$text = "11
a
b
13
begin
t
899
y
50
f
end
91
h";
var_export(preg_match_all('~(?:begin|\G(?!^))(?:(?!end)\D)+\K\d+~s', $text, $out) ? $out[0] : 'no matches');

Output:

array (
  0 => '899',
  1 => '50',
)

Layman's Breakdown:

(?:begin|\G(?!^))  #match "begin" or continue matching from the position immediately after previous match

(?:(?!end)\D)*?    #match zero or more occurrences of any non-digit character while screening for "end".  If end is found, immediately cease pattern execution.

\K                 #restart the fullstring match from this position; this avoids the expense of using a capture group on the desired digits

\d+                #match one or more digits (as much as possible)

See the Pattern Demo link for a more academic breakdown of the pattern.

Upvotes: 0

Stephan
Stephan

Reputation: 43053

Ideal solution

What is really needed here is a positive lookbehind with variable width. The regex would end up like this:

~(?<=begin.*)\d+(?=.*end)~s

However, as of this writing, the PHP regex flavor doesn't support this feature. Only lookbehind with fixed width is supported. (.Net flavor does though).

Workaround

To acheive our goal, we can use preg_replace_callback with the following regex:

~(?<token>begin|end)|(?<number>\d+)|.*?~s

Sample code

function extract_number($input) {
  function matchNumbers($match) {
    static $in_region = false;

    switch ($match['token']) {
       case 'begin':
         $in_region=true;
       break;

       case 'end':
         $in_region=false;
       break;
    }

    if ($in_region && isset($match['number'])) {
       return $match['number'].',';
    } else {
       return '';
    }
  }

  $ret=preg_replace_callback('~(?<token>begin|end)|(?<number>\d+)|.*?~s', 'matchNumbers', $input);

  return array_filter(explode(',',$ret));
}

echo '<pre>';
echo var_dump(extract_number($str));
echo '</pre>';

Output (with OP's example)

array(3) {
  [0]=>
  string(3) "899"
  [1]=>
  string(2) "50"
}

Upvotes: 0

Jerry
Jerry

Reputation: 71578

You can make use of the \G anchor like this, and some lookaheads to make sure that you're not going 'out of territory' (out of the area between the two words):

(?:begin|(?!^)\G)(?:(?=(?:(?!begin).)*end)\D)*?(\d+)

regex101 demo

(?:                  # Begin of first non-capture group
  begin              # Match 'begin'
|                    # Or
  (?!^)\G            # Start the match from the previous end of match
)                    # End of first non-capture group
(?:                  # Second non-capture group
  (?=                # Positive lookahead
    (?:(?!begin).)*  # Negative lookahead to prevent running into another 'begin'
    end              # And make sure that there's an 'end' ahead
  )                  # End positive lookahead
  \D                 # Match non-digits
)*?                  # Second non-capture group repeated many times, lazily
(\d+)                # Capture digits

A debuggex if that also helps:

Regular expression visualization

Upvotes: 7

Related Questions