Garrett Carson
Garrett Carson

Reputation: 55

Regex catastrophic backtracking only in .net

I have a regex that runs fine on http://gskinner.com/RegExr/ and http://regexhero.net/tester/ which is .net. However it timeouts (1h+) in my .net v4.5 code.

(?<OuterDescription>[ \t]*--[ \t]+Description:[ \t]+(?!\<Description)(?<Description>\S[^\r\n]{1,})((\r\n|\r|\n)(?![ \t]*--[ \t]*Modified)[^\r\n]*)*)

with sample data:

-- ========================================================================================================
-- Author:        A Name
-- Create date: 11/26/2012
-- Description:    A description

    --    A multiline description 
        -------------------------------------- Group Name -----------------------------------------
        -- More details
        -- More details
--
--  Modified: 01/7/2012 - Some reason
--  Modified: 12/7/2012 - Some other reason
-- ========================================================================================================

my code looks like this

var isMatch = new Regex(pattern, RegexOptions.None, TimeSpan.FromMinutes(1)).IsMatch(_fileText);

hoping OuterDescription captures from -- Description to just before -- Modified

I've narrowed it down to the [^\r\n]* near the end. I do not know how to fix this to not timeout in c#

Edit:

Thanks for the discussion and answer. It helped move the timeout out of the description. Unfortunately I am still having problems. This is what I have so far

[ \t]*--[ \t]+={3,}
(\r\n|\n|\r)
(?<OuterAuthor>[ \t]*--[ \t]+
    Author:[ \t]+
    (?!\<Author)
    (?<Author>\S[^\r\n]+))
(\r\n|\n|\r)
(?<OuterCreateDate>[ \t]*--[ \t]+
    Create\ [Dd]ate:[ \t]+
    (?!\<Create)
    (?<CreateDate>\S[^\r\n]{1,}))
(\r\n|\n|\r)
(?<OuterDescription>[ \t]*--[ \t]+
    Description:[ \t]+
    (?!\<Description)
    (?<Description>\S[^\r\n]+)
    (?<MultilineDescription>((\r\n|\r|\n)|[^\r\n]*)*?)
    (?=(
        [ \t]*--[ \t]*Modified)|(
        [ \t]*--[ \t]*={3,})
    ))

That works fine but as soon as I add something after this it will timeout.

Sorry I did not mention this first, I assumed one little greedy star change would be all my problems. To get a sense of the final picture, I have an isAdded bool that will determine whether I check for modified lines (the same way as description) then end with the header/footer. like so

var entireCommentHeaderNamedGroupsRegex = headerFooterRegex + newlineRegex
                                          + authorRegex + newlineRegex
                                          + createDateRegex + newlineRegex
                                          + descriptionRegex + newlineRegex
                                          + (_isAdded ? modifiedRegex + newlineRegex : "")
                                          + headerFooterRegex;

some more sample data for when it is not modified:

-- =============================================
-- Author:      Garrett Carson
-- Create date: 10/4/2013
-- Description: This is a test
-- =============================================
CREATE PROCEDURE dbo.ThisIsATest
AS
BEGIN
    PRINT 'This is a test'
END

Also, as mentioned in the comments, I am fairly new to regular expressions (on this scale) so excuse my terminology if this is not actually catastrophic backtracking.

Edit 2

As a final edit, I ended up going with a poor man's fsm

string currentState = "LookForAuthor"
foreach (var line in lines) {
    switch currentState {
        case "LookForAuthor" : {
            ... use author regex ... save to author variable ...
            if(found) currentState = "LookForCreateDate"
            else throw new InvalidCommentException();
        }
        case "LookForCreateDate": {
            ... use createDate regex ... save to createDate variable ...
            ...
        }
        ...
    }
}
if (!_isAdded && !(currentState == "Modified-FirstLine" || currentState == "Modified-MoreLines")) {
    throw new InvalidCommentException();
}

I then re factored to using enums. The bite sized regexes applied line by line no longer cause timeouts.

Upvotes: 4

Views: 536

Answers (1)

staafl
staafl

Reputation: 3225

The following seems to work for me (using RegexOptions.IgnorePatternWhitespace for clarity):

@"(?<OuterDescription>[ \t]*--[ \t]+
    Description:[ \t]+
    (?!\<Description)
    (?<Description> \S[^\r\n]{1,})
    (?<MultilineDescription>(\r?\n|[^\r\n]*)*?)
    (?=[ \t]*--[ \t]*Modified)
)";

In general, nesting greedy quantifiers can lead to the problem you're experiencing. Unfortunately I'm too tired to investigate it in depth, but if you're curious about what's going wrong, I can make a note to look into it later

Upvotes: 2

Related Questions