Mr. Llama
Mr. Llama

Reputation: 20899

Perl performance hit between two common regex methods for string trimming?

So I'm working on a Perl script that does a large amount of processing (nothing too complicated, but lots of it) and decided to do a little benchmark to compare two common methods of trimming strings.

The first method is a quick one-liner:

$word =~ s/^\s+|\s+$//g;

The second method is a little longer, but does the same thing:

$word =~ s/^\s+//;
$word =~ s/\s+$//;

For my benchmarks, I had the script read from a file with 40 million lines, trimming each (does nothing other than that). The average line length is under 20 bytes.

The first method took on average 87 seconds to complete.
The second method took on average 27 seconds to complete.
Doing no processing (just read line, continue) takes an average 16 seconds.

The first method (first pass) will match either all the leading or trailing whitespace, then remove it, then match and remove the leading/trailing whitespace on the other side.
The second method matches and removes all leading whitespace, then matches and removes all trailing whitespace.

Maybe I'm in the wrong here, but why would the second method be over 3x faster than the first?

Upvotes: 4

Views: 321

Answers (3)

JRFerguson
JRFerguson

Reputation: 7516

The regex engine is having to do more work in the first case namely in backtracking to evaluate alternatives. You can see the difference in the code involved:

echo " hello " |perl -Mre=debug -ple 's/^\s+|\s+$//g'
echo " hello " |perl -Mre=debug -ple 's/^\s+//;s/\s+$//'

Upvotes: 11

Mark Byers
Mark Byers

Reputation: 838226

I suspect that the Perl regular expression may be able to optimize the second version by using a static analysis of the pattern. For example it might see that /^foo/ must match at the start of the string. If the match fails there, there is no point in iterating over the rest of the characters in the string checking for matches.

By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line.

Source (Emphasis mine.)

The first version is a more complicated expression and is not so easily optimized.

Upvotes: 2

sehe
sehe

Reputation: 393064

It makes a sense that anchored non-backtracking patterns can be optimized WAY better (effectively a single forward/backward sequential scan starting from a known character position);

Chances are that the 'option' (|) makes the optimizer back off and you get standard backtracking, quite badly so, because many spaces might occur that are not trailing

Upvotes: 5

Related Questions