feelfree
feelfree

Reputation: 11753

Can reducing loop times in C++ codes help increase the speed?

I give the following example to illustrate my question:

void fun(int i, float *pt)
{
    // do something based on i
    std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;

void fun_without_optmization()
{
    float *example;
    example = new float [LOOP];
    for(unsigned int i=0; i<LOOP; i++)
    {
        fun(i,example);            
    }
    delete []example;
}
void fun_with_optimization()
{
    float *example;
    example = new float [LOOP];
    unsigned int unit_loop = LOOP/10;
    unsigned int left_loop = LOOP%10;
    pt = example;
    for(unsigend int i=0; i<unit_loop; i++)
    {
        fun(0,pt); 
        fun(1,pt);  
        fun(2,pt);
        fun(3,pt);  
        fun(4,pt);
        fun(5,pt);
        fun(6,pt);  
        fun(7,pt);
        fun(8,pt);  
        fun(9,pt);
        pt=pt+10;
    }
    delete []example;
}

As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?

Upvotes: 1

Views: 891

Answers (4)

Pandrei
Pandrei

Reputation: 4951

This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.

So this code:

for (int i=0; i<n;i++) //n multiple of 4, for simplicity
   a+=temp; //just a random instruction

Will actually execute faster on a parallel architecture if rewritten like:

for (int i=0;i<n ;i+=4)
{
    temp0 = temp0 +temp1; //reads and additions can be executed in parallel
    temp1 = temp2 +temp3;
    a=temp0+temp1+a;
}

There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.

It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.

Writing your code, like:

    for(unsigend int i=0; i<unit_loop; i++)
    {
        fun(0,pt); 
        fun(1,pt);  
        fun(2,pt);
        fun(3,pt);  
        fun(4,pt);
        fun(5,pt);
        fun(6,pt);  
        fun(7,pt);
        fun(8,pt);  
        fun(9,pt);
        pt=pt+10;
    } 

Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...

On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.

Upvotes: 1

Vlad Feinstein
Vlad Feinstein

Reputation: 11311

Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?

  1. Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
  2. For at least a couple of decades compiler does this optimization for you.
  3. How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
  4. Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.

The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.

Upvotes: 5

Peter - Reinstate Monica
Peter - Reinstate Monica

Reputation: 16026

Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.

Upvotes: 6

Codor
Codor

Reputation: 17605

The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.

Upvotes: 3

Related Questions