Reputation: 381
I am trying to optimize a PHP
regular expression and am seeking guidance from the wonderful Stack Overflow community.
I am attempting to catch pre-defined matches in an HTML
block such as:
##test##
##!test2##
##test3|id=5##
An example text that would run is:
Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.
I have two options so far. Thoughts on which is best from an optimization standpoint?
Option 1
~##(!?)(test|test2|test3)(|\S+?)##~s
Option 2
~\##(\S+)##~s
For the "!"
in example \##!test2##
, it is meant to flag an item for a special behavior while being processed. This could be moved to be an attribute like ##test3|force=true&id=5##
. If this is the case, there'd be:
Option 3
~##(test|test2|test3)(|\S+?)##~s
The biggest factor that we are looking at is performance and optimization.
Thanks in advance for your help and insight!
Upvotes: 1
Views: 220
Reputation: 47992
If you need to dissect and process your matching substrings based on character occurrences, it seems most logical to separate the components during the regex step -- concern yourself with pattern optimization after accuracy and ease of handling is ironed out.
My pattern contains three capture groups, only the middle one requires a positive-length string. Negated capture groups are used for pattern efficiency. I make the assumption that your substrings will not contain #
which is used to delimit the substrings. If they may contain #
, then please update your question and I'll update my answer.
Pattern Explanation:
/ // pattern delimiter
## // match leading substring delimiter
(!)? // optionally capture: an exclamation mark
([^#|]+) // greedily capture: one or more non-hash, non-pipe characters
\|? // optionally match: a pipe
([^#]+)? // optionally capture: one or more non-hash characters
## // match trailing substring delimiter
/ // pattern delimiter
Code: (Demo)
$string='Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus ##!test2## vitae pharetra.';
$result=preg_replace_callback(
'/##(!)?([^#|]+)\|?([^#]+)?##/',
function($m){
echo '$m = ';
var_export($m);
echo "\n";
// execute custom processing:
if(isset($m[1][0])){ //check first character of element (element will always be set because $m[2] will always be set)
echo "exclamation found\n";
}
// $m[2] is required (will always be set)
if(isset($m[3])){ // will only be set if there is a positive-length string in it
echo "post-pipe substring found\n";
}
echo "\n---\n";
return '[some replacement text]';
},$string);
var_export($result);
Output:
$m = array (
0 => '##test##',
1 => '',
2 => 'test',
)
---
$m = array (
0 => '##test3|id=5##',
1 => '',
2 => 'test3',
3 => 'id=5',
)
post-pipe substring found
---
$m = array (
0 => '##!test2##',
1 => '!',
2 => 'test2',
)
exclamation found
---
'Lorem ipsum dolor sit amet, [some replacement text] consectetur adipiscing elit. Pellentesque id congue massa. Curabitur [some replacement text] egestas ullamcorper sollicitudin. Mauris venenatis sed metus [some replacement text] vitae pharetra.'
If you are performing custom replacement processes, this method will "optimize" your string handling.
Upvotes: 0
Reputation: 43169
As others have mentioned, you'll need to time your expressions. Python
has the fantastic timeit
module while for PHP
you need to come up with your own solution:
<?php
$string = <<<DATA
Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.
DATA;
function timeit($regex, $string, $number) {
$start = microtime(true);
for($i=0;$i<$number;$i++) {
preg_match_all($regex, $string, $matches);
}
return microtime(true) - $start;
}
$expressions = ['~##(!?)(test|test2|test3)(|\S+?)##~s', '~\##(\S+)##~s', '~##(test|test2|test3)(|\S+?)##~s'];
$cnt = 1;
foreach ($expressions as $expression) {
echo "Expression " . $cnt . " took " . timeit($expression, $string, 10**5) . "\n";
$cnt++;
}
?>
Expression 1 took 0.45759010314941
Expression 2 took 0.34269499778748
Expression 3 took 0.40994691848755
Obviously, you can play around with other strings and more iterations but this will give you a general idea.
Upvotes: 2