Steve
Steve

Reputation: 12004

performance issues with finding nth occurence of a character with a regular expression

I have a regex to find the nth occurrence of a character in a string, here's the code:

 public static int NthIndexOf(this string target, string value, int n)
    {
        Match m = Regex.Match(target, "((" + value + ").*?){" + n + "}");

        if (m.Success)
        {
            return m.Groups[2].Captures[n - 1].Index;
        }
        else
        {
             return -1;
        }
    }

Now, I have 1594 entries in this string, with 1593 semicolons. If I write:

tempstring.NthIndexOf(";", 1593) 

The answer comes back immediately and correctly. If I give it anything over 1594 it hangs. Does anyone know how to fix this?

Test Case

 string holder = "test;test2;test3";
        string test = "";
        for (int i = 0; i < 600; i++)
        {
            test += holder;
        }
        int index = test.NthIndexOf(";", 2000);

This takes a very long time. Change 600 to 6 and it is very fast. Make 2000 to 1700 and it is very fast as well.

Why is my regular expression so slow?

Upvotes: 0

Views: 1469

Answers (2)

Daniel LeCheminant
Daniel LeCheminant

Reputation: 51081

If you're really only looking for character repetitions, and not string repetitions, then you should be able to replace you method with something simple like

public static int NthIndexOf(this string target, char testChar, int n)
{
   int count = 0;

   for(int i=0; i<target.Length; i++)
   {
      if(target[i] == testChar)
      {
         count++;
         if(count == n) return i;  
      }
   }

   return -1;
}

and use that. It should have far fewer limitations.

As for why your original regex is going slow, here's what I suspect:

  • For your fast case, it's working because it can find a match on it's first pass through (with each group matching exactly one character)
  • For the slow case is because it can't find a match (and won't ever find one, because there aren't enough semicolons to satisfy the regex), but it recursively tries every possible way to break up the string (which is a really big operation)

Upvotes: 9

Gumbo
Gumbo

Reputation: 655379

Try to use a more distinct and efficient regular expression:

"^(?:[^" + value + "]*" + value + "){" + (n - 1) + "}([^" + value + "]*)

This will build the following regular expression for tempstring.NthIndexOf(";", 1593):

^(?:[^;]*;){1592}([^;]*)

But this will only work for single characters as separator.

Another approach would be to step through each character and count the occurences of the character you were looking for.

Upvotes: 2

Related Questions