Reputation: 810
I found this in some code I wanted to optimize.
Here is the snipet:
tempString = bigBuffer.replaceAll("\\n", "");
tempString = tempString.replaceAll("\\t", "");
Then I decided to use the regex wisely and I did this:
tempString = bigBuffer.replaceAll("[\\n\\t]", "");
Then a friend told me to do this instead:
tempString = bigBuffer.replaceAll("\\n|\\t", "");
Since I like to know the result of my changes I did a test to verify if it was a good optimization. So, the result with (java version "1.6.0_27") is with the first code being the reference 100%.
With the pipe it is 121% so it took more time to perform the task.
With the square bracket it is 52% so it took less time to perform the task.
Why does the regex behave differently where it should be the same?
Martin
Upvotes: 5
Views: 124
Reputation: 75222
Generally speaking, a character class ([abc]
) tends to be more efficient than an equivalent alternation (a|b|c
), so I don't know why your friend would suggest that. But in Java, character classes that match only characters from the Latin1 repertoire (i.e. the first 256 Unicode code points) are further optimized. That's probably why you're seeing such a big difference between the second and third techniques.
Again, that's just in Java. In Perl, I would expect the difference between the alternation and the character class to be negligible, it being a much more mature implementation. And in grep it would probably be difficult to measure the difference no matter which of the three approaches you used--it's just that fast.
But as a rule of thumb, if you have a choice between using a character class or an alternation, you should prefer the character class. It may not be faster, but it definitely won't be slower. And improperly used, alternation can have a disastrous effect on performance.
Upvotes: 3
Reputation: 715
The first code snippet looks through bigBuffer twice, the first time replacing the new lines, and the second time replaces the tabs.
The second code snippet would search through bigBuffer only once, checking to see if each character is one or the other. This would result in the speed finishing in only half the time.
The code snippet in the third place is probably poorly compiled, and results in a particularly bad version of the first code's algorithm, though I could not say for sure without examining the path through the regex compilation carefully.
Excellent work on the testing though. Relative timing (percent-based) is useful, absolute timing (millisecond or some such) is not.
Upvotes: 4