user1040478
user1040478

Reputation: 97

RegEx Performance Issue

We are having problem with the following regular expression:

(.*?)\|\*\|([0-9]+)\*\|\*(.*?)

It should match things like: |*25 *|

We are using .Net Framework 4 RegEx Class the code is the following:

string expression = "(.*?)" + 
       Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) + 
       "([0-9]+)" + 
       Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) + 
       "(.*?)";
Regex r = new Regex(expression);
r.Matches(contentText)

It is taking too long (like 60 seconds) with a 40.000 character text.

But with a text of 180.000 the speed its very acceptable (3 sec or less)

The only difference between texts its that the first text(the one which is slow) it is all contained in a single line, with no line breaks. Can this be an issue? That is affecting the performance?

Thanks

Upvotes: 2

Views: 279

Answers (2)

Alan Moore
Alan Moore

Reputation: 75222

@David Gorsline's solution (from the comment) is correct:

string expression =
    Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) + 
    "([0-9]+)" + 
    Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END);

Specifically, it's the (.*?) at the beginning that's doing you in. What that does is take over doing what the regex engine should be doing itself--scan for the next place where the regex can match--and doing it much, much less efficiently. At each position, the (.*?) effectively performs a lookahead to determine whether the next part of the regex can match, and only if that fails does it go ahead and consume the next character.

But even if you used something more efficient, like [^|]*, you would still be slowing it down. Leave that part off, though, and the regex engine can instead scan for the first constant portion of the regex, probably using an algorithm like Boyer-Moore or Knuth-Morris-Pratt. So don't worry about what's around the bits you want to match; just tell the regex engine what you're looking for and get out of its way.

On the other hand, the trailing (.*?) has virtually no effect, because it never really does anything. The ? turns the .* reluctant, so what does it take to make it go ahead and consume the next character? It will only do so if there's something following it in the regex that forces it to. For example, foo.*?bar consumes everything from the next "foo" to the next "bar" after that, but foo.*? stops as soon as it's consumed "foo". It never makes sense to have a reluctant quantifier as the last thing in a regex.

Upvotes: 5

Kobi
Kobi

Reputation: 138007

You've answered your question: the problem is that . fails to match new-lines (it doesn't by default), which results in many failed attempts - almost one for every position on your 40000 character string.
On the long but single lined file, the engine can match the pattern in a single pass over the file (assuming a successful match exists - if it doesn't, I suspect it will take a long time to fail...).
On the shorter file, with many lines, the engine tries to match from the first character. It matches .*? until the end of the first line (this is a lazy match, so a lot more is happening, but lets ignore that), and fails. Now, it stats again from the second character, not the second line! This results in n² complexity even before matching the number.

A simple solution is to make . match newlines:

Regex r = new Regex(expression, RegexOptions.Singleline);

You can also make sure to match from start to end using the absolute start and end anchors, \A and \z:

string expression = "\\A(.*?)" + 
   Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) + 
   "([0-9]+)" + 
   Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) + 
   "(.*?)\\z";

Another note: As David suggests in the comments, \|\*\|([0-9]+)\*\|\* should work well enough. Even if you need to "capture" all text before and after the match, you can easily get it using the position of the match.

Upvotes: 2

Related Questions