Reputation: 95
I know that similar questions and problems were already discussed here, but not exactly the ones I need.
Let's face the problem, that we have a long string, a really long one, the haystack, and we have a needle. Classic problem: search the occurences, all of it. Well Regex.Matches
is a nice function, it only have one problem, which in our case is a serious problem, it doesn't count overlapped strings.
For example
string haystack = "(anything)....bbb....(anything)";
string needle = "bb"; //must be a string not a char
Now we see, that bb
is 2x in that bbb
part of the haystack. (BBb
and bBB
) Regex.Matches
only finds the first one, and he searches further from the third b
. And will not find any other.
Any idea how to search fur substrings with overlapping option in a long haystack string (must be efficient of course)?
Thank you, and sorry if duplicate.
Upvotes: 2
Views: 177
Reputation: 45155
The easier solution is to use string.IndexOf
in a loop as others have suggested, but...if you really want to use Regex, you can use a positive lookahead to deal with overlaps. For example:
string test = "(anything)....bbb....(anything)";
Regex r = new Regex("b(?=b)");
foreach (Match m in r.Matches(test))
{
Console.WriteLine(string.Format("match at: {0}",m.Index));
}
Will match at both indexes 14
and 15
.
This is manageable for a 2 character match, but might become a pain with longer matches.
Determining whether or not this would be faster in practice than the string.IndexOf
in a loop solution is left as an exercise for the reader.
Edit: Because curiosity got the better of me, I had to try and figure out which one is actually faster, here's some test code:
http://dotnetfiddle.net/qneK5u
string test = "(anything)....bbbbbb....(anything)";
Regex r = new Regex("b(?=b)");
var s = new System.Diagnostics.Stopwatch();
s.Start();
int i=0;
int l=5000;
for (;i<l;i++) {
var result = r.Matches(test);
}
Console.WriteLine(s.ElapsedTicks.ToString());
i = 0;
string needle = "bb";
int pos = 0;
s.Reset();
s.Start();
for (;i<l;i++) {
pos = 0;
while((pos = test.IndexOf(needle, pos)) != -1)
{
pos++;
}
}
Console.WriteLine(s.ElapsedTicks.ToString());
The regex solution come out significantly faster, but I'm not sure if I'm doing a fair test.
Upvotes: 1
Reputation: 879
This function returns the list of indexes of the starting matching needle string (may be faster than regex):
public static List<int> GetPositions(string haystack , string needle)
{
List<int> ret = new List<int>();
if (string.IsNullOrEmpty(haystack) || string.IsNullOrEmpty(needle))
return ret;
for(int i=0;i<haystack.Length-needle.Length+1;i++)
{
int j = 0;
for(;j<needle.Length && haystack[i+j]==needle[j];j++);
if(j==needle.Length )
ret.Add(i);
}
return ret;
}
Upvotes: 0
Reputation: 216333
Using string.IndexOf in a loop
string haystack = "(anything)....bbb....(anything)";
string needle = "bb";
int pos = 0;
while((pos = haystack.IndexOf(needle, pos)) != -1)
{
Console.Write("Found at pos:" + pos.ToString());
Console.WriteLine(" - " + haystack.Substring(pos, needle.Length));
pos++;
}
Upvotes: 3
Reputation: 1372
Why not just maintain a pointer of where the substring is found then resume your Regex.Matches from that position + 1 whenever a match is found.
Upvotes: 1