Reputation: 3079

Is this a proper for-loop optimization

For a homework assignment I need to optimize a loop to run in under 7.5 seconds. I think I may have done this because my code runs in 4 seconds. However, I am worried I am not doing it correctly because my instructor told us that anything too far under 7.5 seconds is probably wrong. So I am worried that I might not be doing things correctly. Here is the original code:

#include <stdio.h>
#include <stdlib.h>

#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main (void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    for (i = 0; i < N_TIMES; i++) {

        int     j;

        for (j = 0; j < ARRAY_SIZE; j++) {
            sum += array[j];
            }
        }

    return 0;
}

Here is my optimization:

   for (i = 0; i < N_TIMES; i++) {

        int     j;

        for (j = 0; j < ARRAY_SIZE/2; j += 20) {
            sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
            sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
            }

        }
    sum += sum1;

Are these doing the same number of arithmetic operations? Did I change the code somehow or am I just optimizing well?

Upvotes: 0

Answers (5)

rakib_

Reputation: 142625

It could be optimized in two ways, one is to improve the algorithm, the technique is to improve it at instruction level i.e doing every operation at faster speed as you can. By looking at your code, it seems you're trying to achieve the second one and you're doing it quite rightly. One of the feature found in modern processor is use of "instruction pipelining", there're few stages of it. The order of code execution is -

        IF  Instruction Fetch
        ID  Instruction Decode
        EX  Execution
        Mem Memory access
        WB  Write Back

These op could be done in parralel i.e while you're doing ID for an op, you can do IF for the next op in advance. In first technique, sum += array[j];

in this implementation IF holds up for previous operation to become executed completely i.e in a result of stalled cpu cycles. IF, ID, EX, Mem, WB they all take 1 cpu cycle therefore 5 cpu cycle to complete the full instruction. But with loop unrolling,

                    sum += array[j];    // first op
        sum += array[j+1];  // second op
        sum += array[j+2];
        sum += array[j+3];
        sum += array[j+4];  // fifth op

in this implementation, while executing the first one's ID, doing IF is available for the second on a same cycle i.e simultaneously. On second cpu cycle, you're doing ID of first operation and IF of second operation; on 3rd cycle, you've IF on third op, ID on second op and Ex on first op, therefore it's utilizing instruction level parallelism and reduces number of stalled cpu cycles.

Based on this technique a typical way of optimizing loop is "unrolling" it ie. loop unrolling, you can get a full schematic view and details of "loop unrolling" and instruction pipeling in this link.

To get a proof of what I tried to explin, lets have a test. I've compiled your code and created two executable with two different loop, I used perf to see to get an idea at how things went, the followings are the results:

     Performance counter stats for './test':

          17739.862565 task-clock                #    1.000 CPUs utilized          
               183 context-switches          #    0.010 K/sec                  
             5 cpu-migrations            #    0.000 K/sec                  
               138 page-faults               #    0.008 K/sec                  
===>        58,408,599,809 cycles                    #    3.293 GHz                    
===>        34,387,134,201 stalled-cycles-frontend   #   58.87% frontend cycles idle   
===>         4,229,714,038 stalled-cycles-backend    #    7.24% backend  cycles idle   
        72,056,092,464 instructions              #    1.23  insns per cycle        
                                         #    0.48  stalled cycles per insn
         6,011,271,479 branches                  #  338.857 M/sec                  
           618,206 branch-misses             #    0.01% of all branches        

          17.744254427 seconds time elapsed

and now with unroll-loop-test:

     Performance counter stats for './unroll-loop-test':

           2395.115499 task-clock                #    1.000 CPUs utilized          
            22 context-switches          #    0.009 K/sec                  
             2 cpu-migrations            #    0.001 K/sec                  
               138 page-faults               #    0.058 K/sec                  
====>        7,885,935,372 cycles                    #    3.293 GHz                    
====>        1,569,263,256 stalled-cycles-frontend   #   19.90% frontend cycles idle   
====>       50,629,264 stalled-cycles-backend    #    0.64% backend  cycles idle   
        24,911,629,893 instructions              #    3.16  insns per cycle        
                                         #    0.06  stalled cycles per insn
           153,158,495 branches                  #   63.946 M/sec                  
           607,999 branch-misses             #    0.40% of all branches        

           2.395806562 seconds time elapsed

Take a close look at the number of cycles executed, with unroll loop - stalled-cycles are much less thus requires less number of cpu cycles, on the other hand - without unrolling - number of stalled-cycles is consuming more cpu cycles and thus poor performance. So, yes you're doing quite nice optimization and they're executing same number of arithmatic operations. But also remember that, if you're running this program on a multiprocessor system, then another level of optimization would be to split the whole program into few parts and assign each part to each CPU available on the system and that is something known as "Parallel Programming". Hope my answer will clarify your concept.

Upvotes: 1

lsiebert

Reputation: 667

Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.

So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).

It may or may not be slightly faster to initialize your array statically and set the values to zero;

double array[ARRAY_SIZE] = {0};

Technically {} should work,but {0} is probably more explicit.

a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
1. In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.

For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2 If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.

There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.

Anyway my point is understanding math is important to optimization.

In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;

You could potentially do something like:

int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
  if (array[i] == 0)
      continue;
    for (j = 0; j < N_TIMES; j++) {
      d += array[i];
    }
  }

That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.

In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:

double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {

 //inner stuff
}

!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
edit: also another one. if you know the size of the cache value of the system, you can potentially optimize for that.

Upvotes: 0

lsiebert

Reputation: 667

Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.

It may or may not be slightly faster to initialize your array statically and set the values to zero;

double array[ARRAY_SIZE] = {0};

Technically {} should work,but {0} is probably more explicit.

a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
1. In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.

There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.

Anyway my point is understanding math is important to optimization.

In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;

You could potentially do something like:

int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
  if (array[i] == 0)
      continue;
    for (j = 0; j < N_TIMES; j++) {
      d += array[i];
    }
  }

In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:

double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {

 //inner stuff
}

!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.

Upvotes: 0

Keith Nicholas

Reputation: 44298

you can do:

double  *array = calloc(ARRAY_SIZE, sizeof(double));
double  sum = 0;
int     i;
int     j;

for (j = 0; j < ARRAY_SIZE; j++) {
    sum += array[j];
}
sum *= N_TIMES;

return 0;

but this reduces the operations... to maintain the operations, this will maintain cache hits, and even register hits.

int main (void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;
    int     j;
    double d;

        for (j = 0; j < ARRAY_SIZE; j++) {
            d = array[j];
            for (i = 0; i < N_TIMES; i++) {
                sum += d;
            }
        }

    return 0;
}

Upvotes: 0

nneonneo

Reputation: 179422

Your optimizations are not correct:

for (j = 0; j < ARRAY_SIZE/2; j += 20) {

You now loop half as many times in the inner loop as you should.

Upvotes: 2

Is this a proper for-loop optimization

Answers (5)

Related Questions