MrC
MrC

Reputation: 381

PHP regular expression optimization

I am trying to optimize a PHP regular expression and am seeking guidance from the wonderful Stack Overflow community.

I am attempting to catch pre-defined matches in an HTML block such as:

##test##

##!test2##

##test3|id=5##

An example text that would run is:

Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.

I have two options so far. Thoughts on which is best from an optimization standpoint?

Option 1

~##(!?)(test|test2|test3)(|\S+?)##~s

Option 2

~\##(\S+)##~s

For the "!" in example \##!test2##, it is meant to flag an item for a special behavior while being processed. This could be moved to be an attribute like ##test3|force=true&id=5##. If this is the case, there'd be:

Option 3

~##(test|test2|test3)(|\S+?)##~s

The biggest factor that we are looking at is performance and optimization.

Thanks in advance for your help and insight!

Upvotes: 1

Views: 220

Answers (2)

mickmackusa
mickmackusa

Reputation: 47992

If you need to dissect and process your matching substrings based on character occurrences, it seems most logical to separate the components during the regex step -- concern yourself with pattern optimization after accuracy and ease of handling is ironed out.

My pattern contains three capture groups, only the middle one requires a positive-length string. Negated capture groups are used for pattern efficiency. I make the assumption that your substrings will not contain # which is used to delimit the substrings. If they may contain #, then please update your question and I'll update my answer.

Pattern Demo

Pattern Explanation:

/          // pattern delimiter
##         // match leading substring delimiter
(!)?       // optionally capture: an exclamation mark
([^#|]+)   // greedily capture: one or more non-hash, non-pipe characters
\|?        // optionally match: a pipe
([^#]+)?   // optionally capture: one or more non-hash characters
##         // match trailing substring delimiter
/          // pattern delimiter

Code: (Demo)

$string='Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus ##!test2## vitae pharetra.';

$result=preg_replace_callback(
    '/##(!)?([^#|]+)\|?([^#]+)?##/',
    function($m){
        echo '$m = ';
        var_export($m);
        echo "\n";
        // execute custom processing:
        if(isset($m[1][0])){  //check first character of element (element will always be set because $m[2] will always be set)
            echo "exclamation found\n";
        }
        // $m[2] is required (will always be set)
        if(isset($m[3])){  // will only be set if there is a positive-length string in it
            echo "post-pipe substring found\n";
        }
        echo "\n---\n";
        return '[some replacement text]';
    },$string);

var_export($result);

Output:

$m = array (
  0 => '##test##',
  1 => '',
  2 => 'test',
)

---
$m = array (
  0 => '##test3|id=5##',
  1 => '',
  2 => 'test3',
  3 => 'id=5',
)
post-pipe substring found

---
$m = array (
  0 => '##!test2##',
  1 => '!',
  2 => 'test2',
)
exclamation found

---
'Lorem ipsum dolor sit amet, [some replacement text] consectetur adipiscing elit. Pellentesque id congue massa. Curabitur [some replacement text] egestas ullamcorper sollicitudin. Mauris venenatis sed metus [some replacement text] vitae pharetra.'

If you are performing custom replacement processes, this method will "optimize" your string handling.

Upvotes: 0

Jan
Jan

Reputation: 43169

As others have mentioned, you'll need to time your expressions. Python has the fantastic timeit module while for PHP you need to come up with your own solution:

<?php

$string = <<<DATA
Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.
DATA;

function timeit($regex, $string, $number) {
    $start = microtime(true);

    for($i=0;$i<$number;$i++) {
        preg_match_all($regex, $string, $matches);
    }

    return microtime(true) - $start;
}

$expressions = ['~##(!?)(test|test2|test3)(|\S+?)##~s', '~\##(\S+)##~s', '~##(test|test2|test3)(|\S+?)##~s'];
$cnt = 1;
foreach ($expressions as $expression) {
    echo "Expression " . $cnt . " took " . timeit($expression, $string, 10**5) . "\n";
    $cnt++;
}
?>


Running this on my computer (100k iterations each) yields

Expression 1 took 0.45759010314941
Expression 2 took 0.34269499778748
Expression 3 took 0.40994691848755

Obviously, you can play around with other strings and more iterations but this will give you a general idea.

Upvotes: 2

Related Questions