Reputation: 713
I have a regex that is hanging when it tries to match against a long, unbroken string. Here is a sample console app:
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
Stopwatch sw;
string pattern = @"(?:(?:https?|ftps?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?";
string input = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
Console.WriteLine("Press any key to match regex.");
Console.ReadKey();
Console.WriteLine("Starting regex...");
sw = Stopwatch.StartNew();
Match m = Regex.Match(input, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
sw.Stop();
Console.WriteLine($"Regex completed in {sw.Elapsed}. Press any key to exit.");
Console.ReadKey();
}
}
The Regex is meant to find URLs in user-generated comments. When a normal comment is provided, it processes in no time at all. A 100-word lorem ipsum it processes in about 36ms. As soon as a long, unbroken string is introduced, the regex hangs and, as far as I can tell, never finishes processing. The string does not need to have the same characters repeated.
Any help or insight would be appreciated.
Upvotes: 1
Views: 157
Reputation: 626794
The main problem with your regex is that there are optional patterns alongside obligatory one pattern inside a *
/ +
quantified group, see (?:[a-z\u00a1-\uffff0-9]+-?)*
. This may lead (with long non-matching strings) to the the behavior when the regex engine starts trying all possible routes to match a string, and there may appear too many of them, so many that the system appears to freeze: a catastrophic backtracking.
Thus, if you plan to use your simplified solution you should avoid the similar patterns, use
(?:(?:https?|ftp)://)(?:-\.)?[^\s/?.#-]+(?:\.[^\s/?.#-]+)*(?:/\S*)?
where (?:[^\s/?\.#-]+\.?)+
is unrolled as [^\s/?.#-]+(?:\.[^\s/?.#-]+)*
. While it is longer, the engine fails much quicker than when the optional pattern is inside the quantified group.
If you want to fix the original pattern, use
string pattern = @"(?:(?:http|ftp)s?://)?(?:\S+(?::\S*)?@)?(?:(?!1(?:0|27)(?:\.\d{1,3}){3})(?!1(?:69\.254|92\.168|72\.(?:1[6-9]|2\d|3[0-1]))(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:[a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)*)(?:\.[a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)*)*(?:\.[a-z\u00a1-\uffff]{2,}))(?::\d{2,5})?(?:/\S*)?";
Check how this regex matches and how (?:[a-z\u00a1-\uffff0-9]+-?)*
is unrolled as [a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)*
to match patterns so that each subsequent pattern could not match the same chars. I also merged some negative lookaheads with common "suffixes". Note that (?:\S+(?::\S*)?@)?
is left untouched as it may need to match any :
s up to the last :
before the rest of matching patterns.
Upvotes: 3