Reputation: 376
I have this simple loop:
int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
I compared its performance with its C++ version. I though that the performance should be near the same, because it is very simple code and the range checks are omitted in that case. But it turned out that the C++ version is almost three times faster. So I have implemented C# unsafe version but the performance was even worse. Resharper suggests to convert the loop to linq expression like this:
sum = array.Sum();
That code is many times slower than the original loop in C#
Could someone tell me if there is something more that I can do to improve the performance of this loop (without compiling it to 64 bit version - which is two times faster).
All test where made on 32 bit Release version and run without debugger.
Edit: Small correction. 64 bit version is two times faster with doubles, not ints
Upvotes: 10
Views: 21668
Reputation: 5602
An easy and sometimes significant C# for
loop optimization that is often overlooked is switching the loop counter variable type from int
to uint
. This leads to about 12% speedup on average for your standard i++
(increment) loops with millions of iterations. If your loop iterates less than that, it's probably not going to change performance much.
Note that arrays can be indexed by uint
without casting to int
so you don't lose any of the benefits when indexing inside the loop. The only common reasons to not use this technique are if you need negative loop counter values, or if the loop counter variable needs to be cast to int
for other function calls, etc. inside the loop. As soon as you need to cast, it's probably not worth it.
Upvotes: 0
Reputation: 123
Using immediate operands will improve the performance to some extent,
int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
The above code needs to access two memory locations, i.e int i and array.length;
Instead use,
int[] array = new int[100000000];
int sum = 0;
int arrayLength=array.length;
for (int i = arrayLength-1; i >0; i--)
sum += array[i];
Upvotes: 0
Reputation: 10323
First a couple of general remarks about micro benchmarks like this:
ForEach
loop contains an anonymous delegate that is only JITted when it is first called and so the JIT time is included in the timing the first time the benchmark is run.There are four basic techniques for speeding up the code (if we keep it pure CLR):
Here is the parallel code:
var syncObj = new object();
Parallel.ForEach(Partitioner.Create(0, array.Length),
() => 0,
(src, state, partialSum) => {
int end = src.Item2;
for (int i = src.Item1; i < end; i++)
partialSum += array[i];
return partialSum;
},
partialSum => { lock (syncObj) { s += partialSum; } });
The Partitioner
class lives in the System.Collections.Concurrent
namespace.
On my machine (i7 950, 8 logical cores), the timings I got were:
For loop: 196.786 ms
For loop (separate method): 72.319 ms
Unrolled for loop: 196.167 ms
Unrolled for loop (separate method): 67.961 ms
Parallel.Foreach (1st time): 48.243 ms
Parallel.Foreach (2nd time): 26.356 ms
There was no significant difference between 32 bit and 64 bit code.
Upvotes: 7
Reputation: 881
I've added the following to @Ela's answer:
sum = 0;
watch.Restart();
var _lock = new object();
var stepsize = array.Length / 16;
Parallel.For(0, 16,
(x, y) =>
{
var sumPartial = 0;
for (var i = x * stepsize; i != (x + 1) * stepsize; ++i)
sumPartial += array[i];
lock (_lock)
sum += sumPartial;
});
Console.Write("Parallel.For:" + watch.ElapsedMilliseconds + " ms, result:" + sum);
and then printed the results so you get reference values:
for loop:893ms, result:100000000
linq sum:1535ms, result:100000000
for loop fixed:720ms, result:100000000
foreach sum:772ms, result:100000000
Parallel.For:195 ms, result:100000000
As you can see, waaay faster :)
For Stepsize
, i tried arr.Length / 8
, arr.Length / 16
, arr.Length / 32
(i got i7 cpu(4 cores * 2 Threads = 8 Threads simultaneously)) and they were all pretty much the same, so it's your choice
Edit: I also tried stepsize = arr.length / 100
, which is somewhere @ 250ms, so a little bit slower.
Upvotes: 0
Reputation: 171236
Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.
You'll probably have to add unsafe
as well because the JIT will now be unable to optimize out the array bounds checks.
You can also try to aggregate into multiple sum variables:
int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
sum1 += array[i+0];
sum2 += array[i+1];
}
That might increase instruction-level parallelism because all add
instructions are now independent.
The i+0
is optimized to i
automatically.
I tested it and it shaved off about 30%.
The timings are stable when repeated. Code:
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
var watch = new Stopwatch();
int[] array = new int[500000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//warmup
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
}
for (int i2 = 0; i2 < 5; i2++)
{
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum = 0;
var length = array.Length;
for (int i = 0; i < length; i++)
sum += ptr[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
int sum4 = 0;
var length = array.Length;
for (int i = 0; i < length; i += 4)
{
sum1 += ptr[i + 0];
sum2 += ptr[i + 1];
sum3 += ptr[i + 2];
sum4 += ptr[i + 3];
}
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
}
}
Console.WriteLine("===");
}
Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.
Running this on a Core i7.
Upvotes: 10
Reputation: 13410
var watch = new Stopwatch();
int[] array = new int[100000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
sum = array.Sum();
Console.WriteLine("linq sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
int length = array.Length;
for (int i = 0; i < length; i++)
sum += array[i];
Console.WriteLine("for loop fixed:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
foreach (int i in array)
{
sum += i;
}
Console.WriteLine("foreach sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
sum = array.AsParallel().Sum();
Console.WriteLine("linq parallel sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
Linq Parallel seems to be fasted on my machine at least.
Fixing the length doesn't matter much but improves ~10%
There is not much you could actually do, unmanaged C code will always be faster for this.
Results on my PC are:
for loop: 241ms, result:100000000
linq sum: 559ms, result:100000000
for loop fixed:237ms, result:100000000
foreach sum: 295ms, result:100000000
linq parallel: 205ms, result:100000000
Upvotes: 15