solotherm
solotherm

Reputation: 95

Search every occurence of a substring in a string C#

I know that similar questions and problems were already discussed here, but not exactly the ones I need.

Let's face the problem, that we have a long string, a really long one, the haystack, and we have a needle. Classic problem: search the occurences, all of it. Well Regex.Matches is a nice function, it only have one problem, which in our case is a serious problem, it doesn't count overlapped strings.

For example

string haystack = "(anything)....bbb....(anything)";  
string needle = "bb"; //must be a string not a char

Now we see, that bb is 2x in that bbb part of the haystack. (BBb and bBB) Regex.Matches only finds the first one, and he searches further from the third b. And will not find any other.

Any idea how to search fur substrings with overlapping option in a long haystack string (must be efficient of course)?

Thank you, and sorry if duplicate.

Upvotes: 2

Views: 177

Answers (4)

Matt Burland
Matt Burland

Reputation: 45155

The easier solution is to use string.IndexOf in a loop as others have suggested, but...if you really want to use Regex, you can use a positive lookahead to deal with overlaps. For example:

string test = "(anything)....bbb....(anything)";
Regex r = new Regex("b(?=b)");
foreach (Match m in r.Matches(test)) 
{
    Console.WriteLine(string.Format("match at: {0}",m.Index));
}

Will match at both indexes 14 and 15.

This is manageable for a 2 character match, but might become a pain with longer matches.

Determining whether or not this would be faster in practice than the string.IndexOf in a loop solution is left as an exercise for the reader.

Edit: Because curiosity got the better of me, I had to try and figure out which one is actually faster, here's some test code:

http://dotnetfiddle.net/qneK5u

        string test = "(anything)....bbbbbb....(anything)";
        Regex r = new Regex("b(?=b)");
        var s = new System.Diagnostics.Stopwatch();
        s.Start();
        int i=0; 
        int l=5000;
        for (;i<l;i++) {
            var result = r.Matches(test);
        }
        Console.WriteLine(s.ElapsedTicks.ToString());
        i = 0;
        string needle = "bb"; 
        int pos = 0;
        s.Reset();
        s.Start();
        for (;i<l;i++) {
            pos = 0;
            while((pos = test.IndexOf(needle, pos)) != -1)
            {
                pos++;
            }
        }
        Console.WriteLine(s.ElapsedTicks.ToString());

The regex solution come out significantly faster, but I'm not sure if I'm doing a fair test.

Upvotes: 1

norisknofun
norisknofun

Reputation: 879

This function returns the list of indexes of the starting matching needle string (may be faster than regex):

    public static List<int>  GetPositions(string haystack , string needle)
    {
        List<int> ret = new List<int>();
        if (string.IsNullOrEmpty(haystack) || string.IsNullOrEmpty(needle))
            return ret;
        for(int i=0;i<haystack.Length-needle.Length+1;i++)
        {
            int j = 0;
            for(;j<needle.Length && haystack[i+j]==needle[j];j++);
            if(j==needle.Length )
                ret.Add(i);
        }
        return ret;
    }

Upvotes: 0

Steve
Steve

Reputation: 216333

Using string.IndexOf in a loop

string haystack = "(anything)....bbb....(anything)";  
string needle = "bb"; 
int pos = 0;
while((pos = haystack.IndexOf(needle, pos)) != -1)
{
    Console.Write("Found at pos:" + pos.ToString());
    Console.WriteLine(" - " + haystack.Substring(pos, needle.Length));
    pos++;
}

Upvotes: 3

SJP
SJP

Reputation: 1372

Why not just maintain a pointer of where the substring is found then resume your Regex.Matches from that position + 1 whenever a match is found.

Upvotes: 1

Related Questions