userx01233433
userx01233433

Reputation: 376

What can I do to make this loop run faster?

I have this simple loop:

int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

I compared its performance with its C++ version. I though that the performance should be near the same, because it is very simple code and the range checks are omitted in that case. But it turned out that the C++ version is almost three times faster. So I have implemented C# unsafe version but the performance was even worse. Resharper suggests to convert the loop to linq expression like this:

sum = array.Sum();

That code is many times slower than the original loop in C#

Could someone tell me if there is something more that I can do to improve the performance of this loop (without compiling it to 64 bit version - which is two times faster).

All test where made on 32 bit Release version and run without debugger.

Edit: Small correction. 64 bit version is two times faster with doubles, not ints

Upvotes: 10

Views: 21668

Answers (6)

Special Sauce
Special Sauce

Reputation: 5602

An easy and sometimes significant C# for loop optimization that is often overlooked is switching the loop counter variable type from int to uint. This leads to about 12% speedup on average for your standard i++ (increment) loops with millions of iterations. If your loop iterates less than that, it's probably not going to change performance much.

Note that arrays can be indexed by uint without casting to int so you don't lose any of the benefits when indexing inside the loop. The only common reasons to not use this technique are if you need negative loop counter values, or if the loop counter variable needs to be cast to int for other function calls, etc. inside the loop. As soon as you need to cast, it's probably not worth it.

Upvotes: 0

Nakul Patekar
Nakul Patekar

Reputation: 123

Using immediate operands will improve the performance to some extent,

int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

The above code needs to access two memory locations, i.e int i and array.length;

Instead use,

int[] array = new int[100000000];
int sum = 0;
int arrayLength=array.length;
for (int i = arrayLength-1; i >0; i--)
    sum += array[i]; 

Upvotes: 0

Jeffrey Sax
Jeffrey Sax

Reputation: 10323

First a couple of general remarks about micro benchmarks like this:

  • The timings here are so short that JIT time can be significant. This is important because a parallel ForEach loop contains an anonymous delegate that is only JITted when it is first called and so the JIT time is included in the timing the first time the benchmark is run.
  • The context of the code is important as well. The JITter can do a better job optimizing small functions. Isolating the benchmark code in its own function may have a significant impact on performance.

There are four basic techniques for speeding up the code (if we keep it pure CLR):

  1. Parallellize it. This is obvious.
  2. Unroll loops. This reduces the number of instructions by only doing a comparison every 2 or more iterations.
  3. Using unsafe code. This is not much of a benefit in this case because the main issue (range checks on the array) is optimized away.
  4. Allow the code to be optimized better. We can do this by putting the actual benchmark code in a separate method.

Here is the parallel code:

var syncObj = new object();
Parallel.ForEach(Partitioner.Create(0, array.Length),
    () => 0,
    (src, state, partialSum) => {
        int end = src.Item2;
        for (int i = src.Item1; i < end; i++)
            partialSum += array[i];
        return partialSum;
    },
    partialSum => { lock (syncObj) { s += partialSum; } });

The Partitioner class lives in the System.Collections.Concurrent namespace.

On my machine (i7 950, 8 logical cores), the timings I got were:

For loop: 196.786 ms
For loop (separate method): 72.319 ms
Unrolled for loop: 196.167 ms
Unrolled for loop (separate method): 67.961 ms
Parallel.Foreach (1st time): 48.243 ms
Parallel.Foreach (2nd time): 26.356 ms

There was no significant difference between 32 bit and 64 bit code.

Upvotes: 7

Nefarion
Nefarion

Reputation: 881

I've added the following to @Ela's answer:

            sum = 0;
        watch.Restart();
        var _lock = new object();
        var stepsize = array.Length / 16;
        Parallel.For(0, 16,
            (x, y) =>
            {
                var sumPartial = 0;
                for (var i = x * stepsize; i != (x + 1) * stepsize; ++i)
                    sumPartial += array[i];
                lock (_lock)
                    sum += sumPartial;
            });
        Console.Write("Parallel.For:" +  watch.ElapsedMilliseconds + " ms, result:" + sum);

and then printed the results so you get reference values:

for loop:893ms, result:100000000
linq sum:1535ms, result:100000000
for loop fixed:720ms, result:100000000
foreach sum:772ms, result:100000000
Parallel.For:195 ms, result:100000000

As you can see, waaay faster :) For Stepsize, i tried arr.Length / 8, arr.Length / 16, arr.Length / 32 (i got i7 cpu(4 cores * 2 Threads = 8 Threads simultaneously)) and they were all pretty much the same, so it's your choice

Edit: I also tried stepsize = arr.length / 100, which is somewhere @ 250ms, so a little bit slower.

Upvotes: 0

usr
usr

Reputation: 171236

Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.

You'll probably have to add unsafe as well because the JIT will now be unable to optimize out the array bounds checks.

You can also try to aggregate into multiple sum variables:

int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
    sum1 += array[i+0];
    sum2 += array[i+1];
}

That might increase instruction-level parallelism because all add instructions are now independent.

The i+0 is optimized to i automatically.


I tested it and it shaved off about 30%.

The timings are stable when repeated. Code:

        Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;

        var watch = new Stopwatch();

        int[] array = new int[500000000];
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = 1;
        }

        //warmup
        {
            watch.Restart();
            int sum = 0;
            for (int i = 0; i < array.Length; i++)
                sum += array[i];
        }

        for (int i2 = 0; i2 < 5; i2++)
        {
            {
                watch.Restart();
                int sum = 0;
                for (int i = 0; i < array.Length; i++)
                    sum += array[i];
                Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i++)
                        sum += ptr[i];
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
                }
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum1 = 0;
                    int sum2 = 0;
                    int sum3 = 0;
                    int sum4 = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i += 4)
                    {
                        sum1 += ptr[i + 0];
                        sum2 += ptr[i + 1];
                        sum3 += ptr[i + 2];
                        sum4 += ptr[i + 3];
                    }
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
                }
            }

            Console.WriteLine("===");
        }

Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.

Running this on a Core i7.

Upvotes: 10

MichaC
MichaC

Reputation: 13410

var watch = new Stopwatch();

int[] array = new int[100000000];
for (int i = 0; i < array.Length; i++)
{
    array[i] = 1;
}

watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.Sum();
Console.WriteLine("linq sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
int length = array.Length;
for (int i = 0; i < length; i++)
    sum += array[i];
Console.WriteLine("for loop fixed:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
foreach (int i in array)
{
    sum += i;
}
Console.WriteLine("foreach sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.AsParallel().Sum();
Console.WriteLine("linq parallel sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

Linq Parallel seems to be fasted on my machine at least.

Fixing the length doesn't matter much but improves ~10%

There is not much you could actually do, unmanaged C code will always be faster for this.

Results on my PC are:

for loop:      241ms, result:100000000
linq sum:      559ms, result:100000000
for loop fixed:237ms, result:100000000
foreach sum:   295ms, result:100000000
linq parallel: 205ms, result:100000000

Upvotes: 15

Related Questions