GG.
GG.

Reputation: 2951

Which one will be faster

Just calculating sum of two arrays with slight modification in code

int main()

{

    int a[10000]={0};   //initialize something
    int b[10000]={0};   //initialize something

    int sumA=0, sumB=0;

    for(int i=0; i<10000; i++)
    {
        sumA += a[i];
        sumB += b[i];
    }
        printf("%d %d",sumA,sumB);

}

OR

int main()

{

    int a[10000]={0};   //initialize something
    int b[10000]={0};   //initialize something

    int sumA=0, sumB=0;

    for(int i=0; i<10000; i++)
    {
        sumA += a[i];

    }

        for(int i=0; i<10000; i++)
    {       
        sumB += b[i];
    }
        printf("%d %d",sumA,sumB); 
}

Which code will be faster.

Upvotes: 1

Views: 651

Answers (10)

Skizz
Skizz

Reputation: 71090

Here's some code with timing, built using VS2005:

#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
  LARGE_INTEGER
    start,
    middle,
    end;

  const int
    count = 1000000;

  int
    *a = new int [count],
    *b = new int [count],
    *c = new int [count],
    *d = new int [count],
    suma = 0,
    sumb = 0,
    sumc = 0,
    sumd = 0;

  QueryPerformanceCounter (&start);
  for (int i = 0 ; i < count ; ++i)
  {
    suma += a [i];
    sumb += b [i];
  }
  QueryPerformanceCounter (&middle);
  for (int i = 0 ; i < count ; ++i)
  {
    sumc += c [i];
  }
  for (int i = 0 ; i < count ; ++i)
  {
    sumd += d [i];
  }
  QueryPerformanceCounter (&end);
  cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
  cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
  cout << "Done." << endl << suma << sumb << sumc << sumd;
  return 0;
}

Running this, the latter version is usually faster.

I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:

00401110  mov         edx,dword ptr [eax-0Ch] 
00401113  add         edx,dword ptr [eax-8] 
00401116  add         eax,14h 
00401119  add         edx,dword ptr [eax-18h] 
0040111C  add         edx,dword ptr [eax-10h] 
0040111F  add         edx,dword ptr [eax-14h] 
00401122  add         ebx,edx 
00401124  sub         ecx,1 
00401127  jne         main+110h (401110h) 

Here's the register usage:

  • eax = used to index the array
  • ebx = the grand total
  • ecx = loop counter
  • edx = sum of the five integers accessed in one iteration of the loop

There are a few interesting things here:

  • The compiler has unrolled the loop five times.
  • The order of memory access is not contiguous.
  • It updates the array index in the middle of the loop.
  • It sums five integers then adds that to the grand total.

To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.

Upvotes: 9

Rik Heywood
Rik Heywood

Reputation: 13972

There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).

The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).

Upvotes: 25

UncleBens
UncleBens

Reputation: 41351

For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.

I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):

sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);

Upvotes: 4

Sadat
Sadat

Reputation: 3481

If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example. Otherwise code of first example will be faster than the second one.

Upvotes: 2

Eric Bainville
Eric Bainville

Reputation: 9906

The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.

Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.

Upvotes: 1

Kirill V. Lyadvinsky
Kirill V. Lyadvinsky

Reputation: 99655

C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.

Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).

Upvotes: 2

Michael F
Michael F

Reputation: 40849

If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.

If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).

Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

Upvotes: 0

Jorge C&#243;rdoba
Jorge C&#243;rdoba

Reputation: 52123

In theory, due to cache optimizations the second one should be faster.

Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.

In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.

Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.

Upvotes: 7

Danny T.
Danny T.

Reputation: 1140

The first one will be faster because you loop from 1 to 10000 only one time.

Upvotes: 2

user151323
user151323

Reputation:

The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.

Upvotes: 5

Related Questions