user787267
user787267

Reputation: 2990

C compilers and loop optimisation

I don't have a lot of experience with how compilers actually optimise, and what the difference is between the different levels (-O2 vs -O3 for gcc for example). As such, I'm unsure whether the following two statements are equivalent for an arbitrary compiler:

for(i=0;i<10;++i){
variable1*variable2*gridpoint[i];
}

and

variable3=variable1*variable2;
for(i=0;i<10;++i){
variable3*gridpoint[i];
}

From a processing-time point of view it would make sense to only compute the product of variable1 and variable2 once as they don't change in the loop. This requires extra memory however, and I'm not sure how strongly an optimiser factors this overhead in. The first expression is the easiest to read if you have an equation from a paper/book and want to translate it to something computer-readable, but the second might be the fastest - especially for more complicated equations with a lot of unchanged variables within the loop (I have some pretty nasty non-linear differential equations which I would like to be human readable in the code). Does any of this change if I declare my variables as constants? I hope my question makes sense for an arbitrary compiler since I use both gcc, Intel and Portland compilers.

Upvotes: 8

Views: 983

Answers (3)

abelenky
abelenky

Reputation: 64672

If I were a compiler, I would recognize that both those loops have no left-hand operands, and no side effects at all, (other than setting i to 10), so I'd just optimize the loops out completely.

I'm not saying this actually happens; it just looks like it could happen from the code you provided.

Upvotes: 0

user405725
user405725

Reputation:

It is hard to answer this question adequately for an arbitrary compiler. What could be done with this code depends not only on a compiler, but also on a target architecture. I'll try to explain what a production compiler with good capabilities could do to this code.

From a processing-time point of view it would make sense to only compute the product of variable1 and variable2 once as they don't change in the loop.

You are right. And as Mr. Cat has pointed out, this is called common subexpression elimination. Thus, compiler may generate the code computes the expression only once (or even computes it in compile-time if values for two operands are known to be constant at a time).

A decent compiler may also perform subexpression elimination on functions if it can determine that functions have no side-effects. GCC, for example, can analyze a function if its body is available, but there are also pure and const attributes that can be used to specifically mark functions that should be subject to this optimization (see Function Attributes).

Given that there is no side effect and compiler is able to determine it (in your example, there is nothing standing in a way), two of the snippets are equivalent in this regard (I've checked with clang :-)).

This requires extra memory however, and I'm not sure how strongly an optimiser factors this overhead in.

In fact, this doesn't require any extra memory. The multiplication is done in processor registers and a result is stored in a register as well. It is a matter of eliminating a lot of code and using a single register to store a result, which is always great (and most certainly makes life easier when it comes to register allocation, especially in a loop). So if this optimization can be done then it will be done at no extra cost.

The first expression is the easiest to read..

Both GCC and Clang will perform this optimization. I am not sure about other compilers, though, so you will have to check for yourself. But it is hard to imagine any good compiler that is not doing subexpression elimination.

Does any of this change if I declare my variables as constants?

It may. This is called a constant expression — an expression that contains only constants. A constant expression can be evaluated during compilation rather than at run time. So, for example, if you multiple A, B and C where both A and B are constants, compiler will pre-compute A*B expression only multiple C against that pre-computed value. Compilers may also do this even with non-constant values if they can determine its value at compile-time and be sure that it is not changed. For example:

$ cat test.c
inline int foo(int a, int b)
{
    return a * b;
}

int main() {
    int a;
    int b;
    a = 1;
    b = 2;
    return foo(a, b);
}
$ clang -Wall -pedantic -O4 -o test ./test.c
$ otool -tv ./test
./test:
(__TEXT,__text) section
_main:
0000000100000f70    movl    $0x00000002,%eax
0000000100000f75    ret

There are also other optimization that can take place in case of the above snippets. Below are some of them that come to mind:

The first an most obvious is loop unrolling. Since number of iterations is known at run-time, compiler may decide to unroll the loop. Whether this optimization is applied or not depends on the architecture (i.e. some CPUs can "lock on your loop" and execute the code faster than its unrolled version, which also makes the code more cache friendly by using less space, avoiding extra µOP fusion stages, etc).

The second optimization that can literally speed things up like 50 times is using SIMD instruction (SSE, AVX etc). For example, GCC is very good at it (Intel must be too, if not better). I have verified that the following function:

uint8_t dumb_checksum(const uint8_t *p, size_t size)
{
    uint8_t s = 0;
    size_t i;
    for (i = 0; i < size; ++i)
        s = (uint8_t)(s + p[i]);
    return s;
}

... is transformed into a loop where each step sums 16 values at once (i.e. as in _mm_add_epi8) with an additional code handling alignment and odd (<16) iterations count. Clang, however, have totally failed at this last time I checked. So GCC may reduce your loop that way, too, even if number of iterations are not known.

And if I may I'd like to suggest you do not optimize your code unless you find it to be a bottleneck. Otherwise you may waste hell of a lot of time doing false and premature optimization.

I hope this answers your questions. Good Luck!

Upvotes: 4

hexist
hexist

Reputation: 5240

Yes, you can count on compilers to do a good job at performing sub expression elimination, even through loops. This can result in a slight memory usage increase, however all of that will be considered by any decent compiler, and it's pretty much always the case that it's a win to perform sub expression elimination (since the memory we're talking about is registers and L1 cache).

Here are some quick tests to "prove" it to yourself too. The results indicate that you should basically not try and outsmart the compiler doing manual sub expression elimination, just code naturally and let the compiler do what it's good at (which is stuff like figuring out which expressions should really be eliminated and which shouldn't given target architecture and surrounding code.)

Later, if you are unsatisfied with the performance of your code, you should take a profiler to your code and see what statements and expressions are eating up the most time and then attempt to figure out whether or not you can reorganize the code to help the compiler out, but I'd say the vast majority of the time it wont be simple things like this, it'll be doing things to reduce cache stalls (ie organizing your data better), eliminating redundant inter-procedural calculations, and stuff like that.

(FTR the use of randoms in the following code just ensures the compiler can't get too zealous about variable elimination and loop unrolling)

prog1:

#include <stdlib.h>
#include <time.h>

int main () {
    srandom(time(NULL));
    int i, ret = 0, a = random(), b = random(), values[10];
    int loop_end = random() % 5 + 1000000000;
    for (i=0; i < 10; ++i) { values[i] = random(); }

    for (i = 0; i < loop_end; ++i) {
        ret += a * b * values[i % 10];
    }

    return ret;
}

prog2:

#include <stdlib.h>
#include <time.h>

int main () {
    srandom(time(NULL));
    int i, ret = 0, a = random(), b = random(), values[10];
    int loop_end = random() % 5 + 1000000000;
    for (i=0; i < 10; ++i) { values[i] = random(); }

    int c = a * b;
    for (i = 0; i < loop_end; ++i) {
        ret += c * values[i % 10];
    }

    return ret;
}

And here are the results:

> gcc -O2 prog1.c -o prog1; time ./prog1  
./prog1  1.62s user 0.00s system 99% cpu 1.630 total

> gcc -O2 prog2.c -o prog2; time ./prog2
./prog2  1.63s user 0.00s system 99% cpu 1.636 total

(This is measuring wall time, so don't pay attention to the 0.01 second difference, running it a few times they both range in the 1.62-1.63 second range, so they're the same speed)

Interestingly enough, prog1 was faster when compiled without optimization:

> gcc -O0 prog1.c -o prog1; time ./prog1  
./prog1  2.83s user 0.00s system 99% cpu 2.846 total

> gcc -O0 prog2.c -o prog2; time ./prog2 
./prog2  2.93s user 0.00s system 99% cpu 2.946 total

Also interesting, compiling with -O1 provided the best performance..

gcc -O1 prog1.c -o prog1; time ./prog1 
./prog1  1.57s user 0.00s system 99% cpu 1.579 total

gcc -O1 prog2.c -o prog2; time ./prog2
./prog2  1.56s user 0.00s system 99% cpu 1.563 total

GCC and Intel are great compilers and are pretty smart about handling stuff like this. I don't have any experience with the Portland compiler, but these are pretty basic things for a compiler to do, so I'd be very surprised if it didn't handle situations like this well.

Upvotes: 3

Related Questions