Chris Peterson
Chris Peterson

Reputation: 713

C# dotnetcore Regex Hangs When Given Long, Unbroken String

I have a regex that is hanging when it tries to match against a long, unbroken string. Here is a sample console app:

using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        Stopwatch sw;
        string pattern = @"(?:(?:https?|ftps?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?";
        string input = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

        Console.WriteLine("Press any key to match regex.");
        Console.ReadKey();
        Console.WriteLine("Starting regex...");

        sw = Stopwatch.StartNew();
        Match m = Regex.Match(input, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
        sw.Stop();


        Console.WriteLine($"Regex completed in {sw.Elapsed}. Press any key to exit.");
        Console.ReadKey();
    }
}

The Regex is meant to find URLs in user-generated comments. When a normal comment is provided, it processes in no time at all. A 100-word lorem ipsum it processes in about 36ms. As soon as a long, unbroken string is introduced, the regex hangs and, as far as I can tell, never finishes processing. The string does not need to have the same characters repeated.

Any help or insight would be appreciated.

Upvotes: 1

Views: 157

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

The main problem with your regex is that there are optional patterns alongside obligatory one pattern inside a * / + quantified group, see (?:[a-z\u00a1-\uffff0-9]+-?)*. This may lead (with long non-matching strings) to the the behavior when the regex engine starts trying all possible routes to match a string, and there may appear too many of them, so many that the system appears to freeze: a catastrophic backtracking.

Thus, if you plan to use your simplified solution you should avoid the similar patterns, use

(?:(?:https?|ftp)://)(?:-\.)?[^\s/?.#-]+(?:\.[^\s/?.#-]+)*(?:/\S*)?

where (?:[^\s/?\.#-]+\.?)+ is unrolled as [^\s/?.#-]+(?:\.[^\s/?.#-]+)*. While it is longer, the engine fails much quicker than when the optional pattern is inside the quantified group.

If you want to fix the original pattern, use

string pattern = @"(?:(?:http|ftp)s?://)?(?:\S+(?::\S*)?@)?(?:(?!1(?:0|27)(?:\.\d{1,3}){3})(?!1(?:69\.254|92\.168|72\.(?:1[6-9]|2\d|3[0-1]))(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:[a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)*)(?:\.[a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)*)*(?:\.[a-z\u00a1-\uffff]{2,}))(?::\d{2,5})?(?:/\S*)?";

Check how this regex matches and how (?:[a-z\u00a1-\uffff0-9]+-?)* is unrolled as [a-z\u00a1-\uffff0-9]+(?:-[a-z\u00a1-\uffff0-9]+)* to match patterns so that each subsequent pattern could not match the same chars. I also merged some negative lookaheads with common "suffixes". Note that (?:\S+(?::\S*)?@)? is left untouched as it may need to match any :s up to the last : before the rest of matching patterns.

Upvotes: 3

Related Questions