Meep3D
Meep3D

Reputation: 3756

PHP regex crashing apache

I've got a regex that does matching for a template system, which unfortunately seems to crash apache (it's running on Windows) on some modestly-trivial lookups. I've researched the issue and there are a few suggestions for upping stack size etc, none of which seem to work and I don't really like dealing with such issues by upping limits anyway as it generally just pushed the bug into the future.

Anyway any ideas on how to alter the regex to make it less likely to foul up?

The idea is to catch the innermost block (in this case {block:test}This should be caught first!{/block:test}) which I'll then str_replace out the starting/ending tags and re-run the whole thing through the regex until there are no blocks left.

Regex:

~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)})(?P<contents>(?:(?!{/?block:[0-9a-z-_]+}).)*)(?P<closing>{/block:\3})~ism

Sample template:

<div class="f_sponsors s_banners">
    <div class="s_previous">&laquo;</div>
    <div class="s_sponsors">
        <ul>
            {block:sponsors}
            <li>
                <a href="{var:url}" target="_blank">
                    <img src="image/160x126/{var:image}" alt="{var:name}" title="{var:name}" />
                </a>
            {block:test}This should be caught first!{/block:test}
            </li>
            {/block:sponsors}
        </ul>
    </div>
    <div class="s_next">&raquo;</div>
</div>

It's a long shot I suppose. :(

Upvotes: 9

Views: 532

Answers (3)

godspeedlee
godspeedlee

Reputation: 672

You could use atomic group: (?>...) or possessive quantifiers: ?+ *+ ++.. to suppress/limit the backtracking and speed up matching by unrolling loop technique. My solution:

\{block:(\w++)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}

I have tested from http://regexr.com?31p03.

match {block:sponsors}...{/block:sponsors}:
\{block:(sponsors)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}
http://regexr.com?31rb3

match {block:test}...{/block:test}:
\{block:(test)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}
http://regexr.com?31rb6

the other solution:
in PCRE source code, you can remove the comment from config.h:
/* #undef NO_RECURSE */

following text copy from config.h:
PCRE uses recursive function calls to handle backtracking while matching. This can sometimes be a problem on systems that have stacks of limited size. Define NO_RECURSE to get a version that doesn't use recursion in the match() function; instead it creates its own stack by steam using pcre_recurse_malloc() to obtain memory from the heap.

or you could change pcre.backtrack_limit and pcre.recursion_limit from php.ini (http://www.php.net/manual/en/pcre.configuration.php)

Upvotes: 4

Ian Roberts
Ian Roberts

Reputation: 122414

Does the solution have to be a single regular expression? A more efficient approach might be simply to look for the first occurrence of {/block: (which could be a simple string search or a regex) and then search backwards from that point to find its matching opening tag, replace the span appropriately and repeat until there are no more blocks. If every time round you look for the first closing tag starting from the top of the template then that will give you the most deeply nested block.

The mirror image algorithm would work just as well - look for the last opening tag and then search forward from there for the corresponding closing tag:

<?php

$template = //...

while(true) {
  $last_open_tag = strrpos($template, '{block:');
  $last_inverted_tag = strrpos($template, '{!block:');
  // $block_start is the index of the '{' of the last opening block tag in the
  // template, or false if there are no more block tags left
  $block_start = max($last_open_tag, $last_inverted_tag);
  if($block_start === false) {
    // all done
    break;
  } else {
    // extract the block name (the foo in {block:foo}) - from the character
    // after the next : to the character before the next }, inclusive
    $block_name_start = strpos($template, ':', $block_start) + 1;
    $block_name = substr($template, $block_name_start,
        strcspn($template, '}', $block_name_start));

    // we now have the start tag and the block name, next find the end tag.
    // $block_end is the index of the '{' of the next closing block tag after
    // $block_start.  If this doesn't match the opening tag something is wrong.
    $block_end = strpos($template, '{/block:', $block_start);
    if(strpos($template, $block_name.'}', $block_end + 8) !== $block_end + 8) {
      // non-matching tag
      print("Non-matching tag found\n");
      break;
    } else {
      // now we have found the innermost block
      // - its start tag begins at $block_start
      // - its content begins at
      //   (strpos($template, '}', $block_start) + 1)
      // - its content ends at $block_end
      // - its end tag ends at ($block_end + strlen($block_name) + 9)
      //   [9 being the length of '{/block:' plus '}']
      // - the start tag was inverted iff $block_start === $last_inverted_tag
      $template = // do whatever you need to do to replace the template
    }
  }
}

echo $template;

Upvotes: 4

Alan Moore
Alan Moore

Reputation: 75252

Try this one:

'~(?P<opening>\{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)\})(?P<contents>[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*)(?P<closing>\{/block:(?P=name)\})~i'

Or, in readable form:

'~(?P<opening>
  \{
  (?P<inverse>[!])?
  block:
  (?P<name>[a-z0-9\s_-]+)
  \}
)
(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)
(?P<closing>
  \{
  /block:(?P=name)
  \}
)~ix'

The most important part is in the (?P<contents>..) group:

[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*

Starting out, the only character we're interested in is the opening brace, so we can slurp up any other characters with [^{]*. Only after we see a { do we check to see if it's the beginning of a {/block} tag. If it isn't, we go ahead and consume it and start scanning for the next one, and repeat as necessary.

Using RegexBuddy, I tested each regex by placing the cursor at the beginning of the {block:sponsors} tag and debugging. Then I removed the ending brace from the closing {/block:sponsors} tag to force a failed match and debugged it again. Your regex took 940 step to succeed and 2265 steps to fail. Mine took 57 steps to succeed and 83 steps to fail.

On a side note, I removed the s modifier because because I'm not using the dot (.), and the m modifier because it never was needed. I also used the named backreference (?P=name) instead of \3 as per @DaveRandom's excellent suggestion. And I escaped all the braces ({ and }) because I find it easier to read that way.


EDIT: If you want to match the innermost named block, change the middle portion of the regex from this:

(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)

...to this (as suggested by @Kobi in his comment):

(?P<contents>
  [^{]*(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*)*
)

Originally, the (?P<opening>...) group would grab the first opening tag it saw, then the (?P<contents>..) group would consume anything--including other tags--as long as they weren't the closing tag to match the one found by the (?P<opening>...) group. (Then the (?P<closing>...) group would go ahead and consume that.)

Now, the (?P<contents>...) group refuses to match any tag, opening or closing (note the /? at the beginning), no matter what the name is. So the regex initially starts to match the {block:sponsors} tag, but when it encounters the {block:test} tag, it abandons that match and goes back to searching for an opening tag. It starts again at the {block:test} tag, this time successfully completing the match when it finds the {/block:test} closing tag.

It sounds inefficient describing it like this, but it's really not. The trick I described earlier, slurping up the non-braces, drowns out the effect of these false starts. Where you were doing a negative lookahead at almost every position, now you're doing one only when you encounter a {. You could even use possessive quantifiers, as as @godspeedlee suggested:

(?P<contents>
  [^{]*+(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*+)*+
)

...because you know it will never consume anything that it will have to give back later. That would speed things up a little, but it isn't really necessary.

Upvotes: 4

Related Questions