Mike Sadler
Mike Sadler

Reputation: 1810

Why does Regex hang in C#

I have had a fairly good search around, and although there are quite a few similar questions, I don't believe the answers are applicable.

I have a reasonably inefficient regex searching a reasonably large string. I have tested it in http://regexpal.com with the exact regex and string, and it comes back with the correct answer almost instantaneously.

The C# Regex module with the same inputs hangs - or at least I've left it 10 minutes to do what regexpal can do in fractions of a second.

Is the C# implementation of Regex hopelessly less efficient than http://regexpal.com, or is it genuinely hanging? The regex is to search for two keywords which are separated by an unknown number of lines:

"KEYWORD1(.|\r|\n)+KEYWORD2\t +.+"

And the string is 830 lines long, each line being approximately 30 characters.

Upvotes: 2

Views: 1035

Answers (3)

nhahtdh
nhahtdh

Reputation: 56809

According to the documentation on Regular Expression, . matches any single character except \n. This means that . (which doesn't match \r in Java (default mode), JavaScript, etc.) matches \r in .NET.

Your regex effectively allows 2 branches for the same character \r. The more \r in the input, the longer it takes to run the regex. On a failing input, it will cause exponential complexity based on the number of \r in the input.

Note that regexpal is a JavaScript regular expression tester, and as mentioned above, . in JavaScript excludes \r, \n (and a few other line separator). Since there is no overlap in what they match, each character has at most 1 branch to follow.

One solution is to replace (.|\r|\n)+ with (?s:.+). The s flag will effectively makes . match any character without exception. There is only one branch for any character, so no exponential backtracking.

 +.+ can't cause much inefficiency in this case, since it is already at the end of the pattern. It may cause problem (quadratic complexity) if there is something else following it, though. For example, if there is $ at the end, then in the failing case, when the pattern  +.+$ is matches against a suffix with lots of spaces, followed by a newline at the end, then an unoptimized engine will try all ways to divide the consecutive spaces into 2 parts.

Upvotes: 5

Joshua Powers
Joshua Powers

Reputation: 72

Basically, your regular expression is recursive. Removing the final .+ solves the issue. Here is a good article on avoiding it in the future. Why does .NET hang? Probably because it will never abort the search. Perhaps the JavaScript parser you used aborts if it detects too many steps or goes too deep into recursive searches.

Upvotes: -1

John Bustos
John Bustos

Reputation: 19544

As was stated in the comments, a common issue with Regex is having a pattern along the lines of:

Word1.+Word2

Because if your text was very large and had something like:

Word1 ... Word1 ... Word2 ... Word2 ... Word1 .... Word2 ... Word2

You would have ALL combinations matched that began with Word1 and ended with Word2 - Even when Word1 or Word2 were in between them.

Generally that's not what you're looking for and want the shortest set of characters between your start and end points (or not have Word1 show up again). For that your Regex would best be changed to:

Word1.+?Word2

Hope that makes sense.

Upvotes: 2

Related Questions