Reputation: 3007
I'm parsing some big log files and have some very simple string matches for example
if(m/Some String Pattern/o){
#Do something
}
It seems simple enough but in fact most of the matches I have could be against the start of the line, but the match would be "longer" for example
if(m/^Initial static string that matches Some String Pattern/o){
#Do something
}
Obviously this is a longer regular expression and so more work to match. However I can use the start of line anchor which would allow an expression to be discarded as a failed match sooner.
It is my hunch that the latter would be more efficient. Can any one back me up/shoot me down :-)
Upvotes: 5
Views: 3609
Reputation: 103754
You can gain tremendous insight into what the regex engine is doing in Perl with the use re debug
pragma. It is documented here
It is always helpful to review the Perl suggested performance techniques, including suggested timing methods.
If I run this small test:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $target="aeiou";
my $str="lkdjflzdjfljdsflkjasdjf asldkfj lasdjf dslfj sldfj asld alskdfj lasd f";
my $str2=$str.$target;
timethese(10_000_000, {
'float' => sub {
die "no match" unless $str2=~m/$target/o;
},
'anchored' => sub {
die "no match" unless $str2=~m/^.*$target/o;
},
'prefixed' => sub {
die "no match" unless $str2=~m/^$str$target/o ;
},
});
I get the output of:
Benchmark: timing 10000000 iterations of anchored, float, prefixed...
anchored: 4 wallclock secs ( 3.46 usr + 0.01 sys = 3.47 CPU) @ 2881844.38/s
float: 2 wallclock secs ( 1.87 usr + 0.00 sys = 1.87 CPU) @ 5347593.58/s
prefixed: 4 wallclock secs ( 3.05 usr + 0.01 sys = 3.06 CPU) @ 3267973.86/s
Which leads to the conclusion that non-anchored (floating) version is way faster. However, the regex and the source may change that. YMMV and test test test...
Upvotes: 2
Reputation: 75222
Are you saying you can anchor the regex by adding a static prefix, like this?
/^blah blah The Real Regex/
That certainly won't hurt performance, and it will probably help, but not for the reason you think. Although they're best known for the "magical" stuff like anchors and lookarounds and capturing groups, what regex engines are best at is matching literal sequences of characters. The longer the sequence, the faster the match (up to a point, of course).
In other words, it's the addition of the static prefix, not the anchor, that's giving you the boost.
Upvotes: 1
Reputation: 3007
I did some timings as recommended. here are the results for my app. Its the whole app, not just the regex searches. It scans 60,000 lines. 11 Regular expressions average short length was about 30 characters. The longer but anchored ones are about 120.
Short
real 0m58.780s
user 0m54.940s
sys 0m0.790s
Long (anchored)
real 0m54.260s
user 0m53.630s
sys 0m0.490s
Long (not anchored)
real 0m54.705s
user 0m54.130s
sys 0m0.400s
So anchoring the long strings is slightly faster. Although not by much. It would appear that if my strings were any larger it might be a different matter.
Upvotes: 3
Reputation: 881243
Speed of an RE depends on two factors, the RE itself and the data being passed through the RE. In general, an anchored RE (start or end) with no backtracking will be faster than others. But if you're processing a file where every line is empty, there's no speed difference between /^hello/
and /hello/
(at least if the RE engine is written correctly).
But the rule I follow is: measure, don't guess.
Upvotes: 3
Reputation: 30225
The line anchor makes it faster. I have to add though that the //o modifier is not necessary here, in fact it does nothing. That's code smell to me.
There used to be valid usages for //o, but these days that is provided by qr//
Upvotes: 4
Reputation: 1371
I vote for the one anchored at the beginning for exactly the reason you state!
Upvotes: 0
Reputation: 992857
I think you'll find that starting your regex with ^ will definitely be faster, because the regex engine doesn't have to look any further than the left edge of the string for a match.
This is something that you could easily test and measure, of course. Do a regex match 10 million times or so, measure how long it takes, then try again with a different regex.
Upvotes: 6