Hamid
Hamid

Reputation: 1099

How to prevent Regular Expression of hang (or set time out for it) in .Net

I am using regular expression to a remove comment tag in html file (Pattern is: "<!--(.|\s)*?--!?>")

But some sites are not using a standard html tag, sample:

<script language="javascript">
    <!-- 
     js code ...
    </script>

In this case, my regular expression will hang, and also try-catch does not catch error. How would I remedy this problem?

Upvotes: 2

Views: 5796

Answers (2)

Jan Goyvaerts
Jan Goyvaerts

Reputation: 21999

The problem with the performance of your regex is trivial. Don't do this:

(.|\s)*

Whether the quantifier is lazy or greedy is entirely besides the point. The problem is that . and \s are not mutually exclusive. Spaces can be matched by both . and \s. Thus if your regex encounters a space, it will first match the space with ., and if the remainder of the regex fails, it will match it again with \s. If you have two spaces, it will first match both with ., then the first with . and the second with \s, then the frist with \s and the second with ., and then both with \s. As you can see, your regex has a complexity of O(2^N) when it encounters a run of spaces followed by something the remainder of the regex cannot match. If you have 10 spaces, there are 1024 permutations. If you have 32 spaces, there are 4 billion permutations.

The reason you only see the problem when your regex fails is that when the regex succeeds, the . simply matches all the spaces, and \s never gets any of the action.

I know what you're trying to do: you want to match a run of "any" character, including line breaks, which are normally not matched by the dot. The proper solution is to set RegexOptions.SingleLine and use this regex instead:

.*

If you can't set RegexOptions.SingleLine, use this mode modifier to do the same:

(?s).*

If you can't use that mode modifier, e.g. because JavaScript doesn't support it, use a character class with two complementary shorthands:

[\S\s]*

Once you get that horrid (.|\s) alternation out of your regex it will work perfectly. There's no need to use any of the complicated regexes the others suggested. A single lazy quantifier always expands linearly. Alternation that is not mutually exclusive always kills your regex. I indeed call this catastrophic backtracking.

And if you want a regex that allows a tag to terminate a comment, try this:

(?s)<!--.*?(-->|</script>)

Upvotes: 12

Alan Moore
Alan Moore

Reputation: 75222

You can rewrite the regex so that it fails as quickly as possible when no match is possible, like so:

<!--(?>(?:[^-]+|-(?!->))*)-->

If the unclosed comment in your example is followed by a complete comment further along, this regex will match from the the first <!-- to the first -->, like this:

<!-- blah <!-- blah -->

That's how your browser is supposed to treat SGML comments. In fact, if there's no matching -->, everything after the <!-- is commented out. So the regex should really be:

<!--(?>(?:[^-]+|-(?!->))*)(?:-->|\z)

But I suspect this isn't really what you want. For a better answer, we'll need to know what you want to do about malformed HTML like the snippet you posted.

Upvotes: 2

Related Questions