Gev
Gev

Reputation: 439

A long Regex for a large string

I have some process, that is run over some files that contains a lot of strange data. The process needs to find some string and replace it with something else. Here is the function:

 private static string ReplaceRegex(string inputText, string regex, string replacement)
        {
            return (replacement != null)?new Regex(regex, RegexOptions.IgnoreCase).Replace(inputText, replacement).Trim(): string.Empty;
        }

It works properly in most cases, but once I passed to this function an inputText with 3491 chars length, and this string as a regex:

"\[HYPERLINK\]\s*(?:\<[\s\S]*?\>)*\s*([\s\S]*?)\s*(?:\<[\s\S]*?\>)*\s*\[\/HYPERLINK]\s*(?:\<NO1\>)?\s*(?:\<WC1?\>)?\s*\[URL\]\s*(?:\<NO1?\>)?\s*(?:\<WC1?\>)?\s*([\s\S]*?)\s*(?:\<NO1?\>)?\s*(?:\<WC1?\>)?\s*\[\/URL\](?:\<NO1?\>)?(?:\<WC1?\>)?"

The process stucks.

I was waiting that system would throw OutOfMemory exception, but it doesn't, it just stucks. I was waiting for it to response for hours, but it didn't respond.

Any ideas how I can solve this?

EDIT: Thank guys.

Just to be honest, I inherited this code with the project and now trying to figure out what's going on. And I do not know why somebody have done it this way.

Upvotes: 1

Views: 7143

Answers (6)

ΩmegaMan
ΩmegaMan

Reputation: 31636

Quit using * (zero or more) and give the parser a better processing hint by using + (one or more) when you know there will be data there. Only use * under an or type situation, where you don't want it to fail and allow nothing to be included.

Upvotes: 0

user557597
user557597

Reputation:

[\s\S]*? is turning into a greedy monster:

 \[HYPERLINK\]

 ( [\s\S]*? )                 # This turns into a greedy monster

 \[\/HYPERLINK\]              # as soon as one of   <- this

 \s* 
 (?: \<NO1\> )?
 \s* 
 (?: \<WC1?\> )?
 \s* 

 \[URL\]                      # or  <- this


 ( [\s\S]*? )                      #  This turns into a greedy monster of a greedy monster


 \[\/URL\]                   # or   <- this   are missing

Edit: you may be able to get around this by something like below, but if this is too restrictive, you will need at least some mid-expression anchors to break it up.

 # \[HYPERLINK\]\s*(?:\<[^>]*\>)*\s*((?:(?!\[\/HYPERLINK\]|\<[^>]*\>)[\S\s])*)\s*(?:\<[^>]*\>)*\s*\[\/HYPERLINK\]\s*(?:\<NO1\>)?\s*(?:\<WC1?\>)?\s*\[URL\]\s*(?:\<NO1?\>)?\s*(?:\<WC1?\>)?\s*((?:(?!\[\/URL\]|\<[^>]*\>)[\S\s])*)\s*(?:\<NO1?\>)?\s*(?:\<WC1?\>)?\s*\[\/URL\](?:\<NO1?\>)?(?:\<WC1?\>)?

 \[ HYPERLINK \]

 \s* 
 (?: \< [^>]* \> )*
 \s* 
 (
      (?:
           (?! \[\/HYPERLINK \] | \< [^>]* \> )
           [\S\s] 
      )*
 )
 \s* 
 (?: \< [^>]* \> )*
 \s* 

 \[\/HYPERLINK\]

 \s* 
 (?: \<NO1\> )?
 \s* 
 (?: \<WC1?\> )?
 \s* 

 \[URL\]

 \s* 
 (?: \<NO1?\> )?
 \s* 
 (?: \<WC1?\> )?
 \s* 
 (
      (?:
           (?! \[\/URL\] | \< [^>]* \> )
           [\S\s] 
      )*
 )
 \s* 
 (?: \<NO1?\> )?
 \s* 
 (?: \<WC1?\> )?
 \s* 

 \[\/URL\]

 (?: \<NO1?\> )?
 (?: \<WC1?\> )?

Upvotes: 1

JDB
JDB

Reputation: 25835

You have what's called "catastrophic backtracking".

Basically, when you have a variable-length expression (*, +, etc.) followed by an "overlapping" (that is, both expression could match on the same set of characters) variable length expression, you can get into a tug of war between the two expressions. This usually only happens when the entire expression fails and .NET regex enginge attempts to shift the input text between the overlapping expressions, so often it's missed in testing.

Your expression has many sub-expressions which could cause this, but here's an example:

\s*([\s\S]*?)

The first part, \s*, can match zero or more whitespace characters. The second, [\s\S]*?, can also match zero or more whitespace characters (in addition to non-whitespace characters). This will cause catastrophic backtracking in the event that your input fails on the first try and there are multiple whitespace characters to match.

I wrote a bit about this issue here as well:
How can I recognize an evil regex?

Upvotes: 5

Nicholas Carey
Nicholas Carey

Reputation: 74277

The problem is almost certainly backtracking. Regular expressions are greedy. The general rule is to take the 'leftmost longest' match. Something like .*Foo.*Bar.* is greedy:

  • The first .* will consume the entire source text.
  • Then, since that is not followed by Foo, it will start backtracking, shortening the match until it has a Foo.
  • The next .* will again slurp up everthing from the current point in the source text to the end.
  • And again will fail, since Bar is not found.
  • So it backtracks again until a Bar is found. You should note at this point that if no Bar is found, backtracking continues further back, looking for another Foo.

    You can imagine the sort of combinatorial explosion created by a convoluted regular expression with a lot of backtracking.

  • The final .* will consume everything from that point to the end of the string.

So...

Get Jeffrey Fried's opus, Mastering Regular Expressions

Master Regular Expressions Cover

It will help you out immensely.

Upvotes: 3

Reacher Gilt
Reacher Gilt

Reputation: 1813

I'm interested in the repeating pattern you have of (?:\<[\s\S]?>) and similar. [\s\S]? matches "one whitespace or non-whitespace character". I think this is functionally equivalent to a regex .

You also have things like \s*([\s\S]*?)\s* which is "zero or more whitespaces, followed by zero or more (whitespaces or non-whitespaces), followed by zero or more whitespaces". This is technically legal, but weird in a practical sense.

Even though "it works", I would advise you to either rewrite your regex in a way that's maintainable to you, or to find a way to parse your strings without the use of regular expressions.

Upvotes: 0

ben_re
ben_re

Reputation: 528

It could be something to do with the fact that you're using a lot of *'s. It's very easy to create a regex that can cripple your system by eating all your resources, especially when creating such a large one.

Personally, I would try adding some limits in there (like .{1,100}).

Upvotes: 0

Related Questions