Reputation: 7775
This simple piece of c# code that is meant to find script blocks in HTML takes 0.5 seconds to run on a 74K char string with only 9 script blocks in it. This is undebuged release binary on 2.8Ghz i7 CPU. I made several runs though this code to make sure that performance is not impeded by JIT. It is not.
This is VS2010 .NET 4.0 Client Profile. x64
Why is this so slow?
int[] _exclStart = new int[100];
int[] _exclStop = new int[100];
int _excl = 0;
for (int f = input.IndexOf("<script", 0); f != -1; )
{
_exclStart[_excl] = f;
f = input.IndexOf("</script", f + 8);
if (f == -1)
{
_exclStop[_excl] = input.Length;
break;
}
_exclStop[_excl] = f;
f = input.IndexOf("<script", f + 8);
++_excl;
}
Upvotes: 9
Views: 11594
Reputation: 5715
I just test IndexOf performance with .NET 4.0 on Windows 7
public void Test()
{
var input = "Hello world, I'm ekk. This is test string";
TestStringIndexOfPerformance(input, StringComparison.CurrentCulture);
TestStringIndexOfPerformance(input, StringComparison.InvariantCulture);
TestStringIndexOfPerformance(input, StringComparison.Ordinal);
Console.ReadLine();
}
private static void TestStringIndexOfPerformance(string input, StringComparison stringComparison)
{
var count = 0;
var startTime = DateTime.UtcNow;
TimeSpan result;
for (var index = 0; index != 1000000; index++)
{
count = input.IndexOf("<script", 0, stringComparison);
}
result = DateTime.UtcNow.Subtract(startTime);
Console.WriteLine("{0}: {1}", stringComparison, count);
Console.WriteLine("Total time: {0}", result.TotalMilliseconds);
Console.WriteLine("--------------------------------");
}
And the result is:
CurrentCulture:
Total time: 225.4008
InvariantCulture:
Total time: 187.2003
Ordinal:
Total time: 124.8003
As you can see performance of Ordinal is a little better.
Upvotes: 2
Reputation: 33252
I don't discuss the code here, that probably coul be written with Regex and so on... but in order to me is slow because the IndexOf() *inside* the for
always rescan the string from the beginning ( it always start from index 0 ) try to scan from the last occurrency found instead.
Upvotes: 1
Reputation: 8693
I used the source on this page as an example, I then duplicated the content 8 times, resulting in a page some 334,312 bytes long. Using StringComparision.Ordinal yields massive performance difference.
string newInput = string.Format("{0}{0}{0}{0}{0}{0}{0}{0}", input.Trim().ToLower());
//string newInput = input.Trim().ToLower();
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int[] _exclStart = new int[100];
int[] _exclStop = new int[100];
int _excl = 0;
for (int f = newInput.IndexOf("<script", 0, StringComparison.Ordinal); f != -1; )
{
_exclStart[_excl] = f;
f = newInput.IndexOf("</script", f + 8, StringComparison.Ordinal);
if (f == -1)
{
_exclStop[_excl] = newInput.Length;
break;
}
_exclStop[_excl] = f;
f = newInput.IndexOf("<script", f + 8, StringComparison.Ordinal);
++_excl;
}
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
running 5 times yields almost the same result for each (the loop timings did not significantly change so for this simple code there is almost no time spent for JIT to compile it)
Output using your original code (in Milliseconds):
10.2786
11.4671
11.1066
10.6537
10.0723
Output using the above code instead (in Milliseconds):
0.3055
0.2953
0.2972
0.3112
0.3347
Notice that my test results are around 0.010 seconds (original code) and 0.0003 seconds (for ordinal code). Meaning you have something else wrong other than this code directly.
If as you say, using StringComparison.Ordinal
does nothing for your performance then that means that either your using incorrect timers to time your performance, or you have a large overhead in reading your input
value such as reading it from a stream again which you otherwise don't realise.
Tested under Windows 7 x64 running on a 3GHz i5 using .NET 4 Client Profile.
Suggestions:
StringComparison.Ordinal
System.Diagnostics.Stopwatch
to time performanceinput
instead of using values external to the function (eg: string newInput = input.Trim().ToLower();
)Again I stress, I am getting 50 times faster speed for a test data that is apparently over 4 times larger in size using the exact same code that you provide. Meaning my test is running some 200 times faster than yours which is not something anyone would expect given we're both running same environment and just i5 (me) versus i7 (you).
Upvotes: 23
Reputation: 284806
The IndexOf overload you're using is culture-sensitive, which will affect performance. Instead, use:
input.IndexOf("<script", 0, StringComparison.Ordinal);
Upvotes: 5
Reputation: 548
I would recommend using RegEx for this, it offers significant performance improvement because the expressions are compiled only once. Whereas IndexOf is essentially a loop which runs on per character basis which probably means, you have 3 "loops" within your main for loop, ofcourse, IndexOf won't be as slow as a regular loop, but still when the input size grows the time increases. Regex has inbuilt functions that would return the number and positions of occurrences of each pattern you define.
Edit: this might shed some more light on the performance of IndexOf IndexOf Perf
Upvotes: 3