RegEx Performance Issue

Question

We are having problem with the following regular expression:

(.*?)\|\*\|([0-9]+)\*\|\*(.*?)

It should match things like: |*25 *|

We are using .Net Framework 4 RegEx Class the code is the following:

string expression = "(.*?)" + 
       Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) + 
       "([0-9]+)" + 
       Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) + 
       "(.*?)";
Regex r = new Regex(expression);
r.Matches(contentText)

It is taking too long (like 60 seconds) with a 40.000 character text.

But with a text of 180.000 the speed its very acceptable (3 sec or less)

The only difference between texts its that the first text(the one which is slow) it is all contained in a single line, with no line breaks. Can this be an issue? That is affecting the performance?

Thanks

Alan Moore · Accepted Answer

@David Gorsline's solution (from the comment) is correct:

string expression =
    Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) + 
    "([0-9]+)" + 
    Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END);

Specifically, it's the (.*?) at the beginning that's doing you in. What that does is take over doing what the regex engine should be doing itself--scan for the next place where the regex can match--and doing it much, much less efficiently. At each position, the (.*?) effectively performs a lookahead to determine whether the next part of the regex can match, and only if that fails does it go ahead and consume the next character.

But even if you used something more efficient, like [^|]*, you would still be slowing it down. Leave that part off, though, and the regex engine can instead scan for the first constant portion of the regex, probably using an algorithm like Boyer-Moore or Knuth-Morris-Pratt. So don't worry about what's around the bits you want to match; just tell the regex engine what you're looking for and get out of its way.

On the other hand, the trailing (.*?) has virtually no effect, because it never really does anything. The ? turns the .* reluctant, so what does it take to make it go ahead and consume the next character? It will only do so if there's something following it in the regex that forces it to. For example, foo.*?bar consumes everything from the next "foo" to the next "bar" after that, but foo.*? stops as soon as it's consumed "foo". It never makes sense to have a reluctant quantifier as the last thing in a regex.

RegEx Performance Issue

Answers (2)

Related Questions