Reputation: 355

What GCC Optimization Is Occurring In This For Loop?

Using gcc 4.6 with -O3, I have timed the following four codes using the simple time command

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0; 
  unsigned int numIterations = 1e7; 
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 1 runs in 0.09 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 2 runs in 17.6 seconds

int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
}

Case 3 runs in 0.8 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999999;
  }
  std::cout<<val<<std::endl;
}

Case 4 runs in 0.8 seconds

My question is why is the second case so much slower than all the other cases? Case 3 shows that removing the cout brings the runtime back in line with what is expected. And Case 4 shows that changing the multiplier also greatly reduces the runtime. What optimization or optimizations are not occuring in case 2 and why?

Update:

When I originally ran these tests there was no separate variable numIterations, the value was hard-coded in the for loop. In general, hard-coding this value made things run slower than the cases given here. This is especially true for Case 3 which ran almost instantly with the numIterations variable as shown above, indicating James McNellis is correct about the entire loop being optimized out. I'm not sure why hard-coding the 1e8 into the for loop prevents the removal of the loop in Case 3 or makes things slower in the other cases, however, the basic premise of Case 2 being significantly slower is even more true.

Diffing the assembly output gives for the cases above gives

Case 2 and Case 1:
movl $100000000, 16(%esp)

movl $10000000, 16(%esp)

Case 2 and Case 4:
.long -652835029
.long 1072691150

.long -417264663
.long 1072693245

Upvotes: 7

Answers (4)

David Hammen

Reputation: 33126

René Richter was on the right track regarding underflow. The smallest positive normalized number is about 2.2e-308. With f(n)=0.999**n, this limit is reached after about 708,148 iterations. The remaining iterations are stuck with unnormalized computations.

This explains why 100 million iterations take slightly more than 10 times the time needed to perform 10 million. The first 700,000 are done using the floating point hardware. Once you hit denormalized numbers, the floating point hardware punts; the multiplication is done in software.

Note that this would not be the case if the repeated computation properly calculated 0.999**N. Eventually the product would reach zero, and from that point on the multiplications would once again be done with the floating point hardware. That is not what happens because 0.999 * (smallest denormalized number) is the smallest denormalized number. The continued product eventually bottoms out.

What we can do is change the exponent. An exponent of 0.999999 will keep the continued product within the realm of normalized numbers for 708 million iterations. Taking advantage of this,

Case A  : #iterations = 1e7, exponent=0.999, time=0.573692 sec
Case B  : #iterations = 1e8, exponent=0.999, time=6.10548 sec
Case C  : #iterations = 1e7, exponent=0.999999, time=0.018867 sec
Case D  : #iterations = 1e8, exponent=0.999999, time=0.189375 sec

Here you can easily see how much faster the floating point hardware is than is the software emulation.

Upvotes: 8

René Richter

Reputation: 3909

Compile with option -S and look at the generated assembler output (files named *.s).

Edit:

In Program 3 the loop is removed since the result is not used.

For case 1, 2 and 4 let's do some math: The base 10 logarithm of the result of case 1 is 1e7 * log10(0.999) = -4345 (roughly). For case 2 we get 1e8*log10(0.999) = -43451. For case 4 it is 1e8*log10(0.9999) = -4343. The result itself is pow(10, logarithm).

The floating point unit uses 80 bit long doubles internally on x86/x64 cpus. When the number becomes smaller than 1.9E-4951 we get a floating point underflow as @James Kanze pointed out. This happens only in case 2. I don't know why this takes longer than a normalized result, maybe someone else can explain.

Upvotes: 8

James Kanze

Reputation: 154027

For what it's worth, I've done a few tests on all four versions, with the following results:

First, I get the same discrepencies in speed as you do. More or less: I have a slower machine, and I'm seeing distinct differences between tests 1, 3 and 4. But they remain more than two orders of magnitude faster than test 2.
Test 3 was by far the fastest: looking at the generated code shows that g++ did remove the loop, since the results weren't being used.
Test 1 was a little more than a tenth as fast as test 4. About what one would expect, more or less.
And the only difference in the generated code between tests 1, 3 and 4 is the constant. There is absolutely no difference in the loop.

So the difference isn't a question of compiler optimizations. I can only speculate that it somehow has to do with underflow handling. (But then, I would expect a slowdown in loop 1 as well.)

Upvotes: 1

David Grayson

Reputation: 87506

I'm running g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3 on 64-bit Linux. I used -O3 just like you did.

Here are my timing results when I do the tests:

Case 2: 6.81 s
Case 3: 0.00 s
Case 4: 0.20 s

I looked at the assembly of all three tests.

Case 3 is fast because GCC does indeed optimize away the entire for loop. The main function is simply (after removing labels and .cfi* statements):

main:
        xorl    %eax, %eax
        ret

The only difference in the assembly for Case 2 and Case 3 is the constant that presumably represents 0.999 or 0.999999:

$ diff case2.s case4.s
121,122c121,122
<   .long   3642132267
<   .long   1072691150
---
>   .long   3877702633
>   .long   1072693245

That is the only difference in the assembly. GCC performed the same set of optimizations for Case 2 and Case 4 but Case 2 takes 30 times as long as Case 4. My conclusion from this is that floating point multiplication on the x64 architecture must take a variable amount of time depending on what values you are multiplying. That should not be a very surprising statement, but it would be nice if someone who knows more about x64 could explain to us why that is true.

I did not carefully examine case 1 because you are only doing 1e7 iterations instead of 1e8, so of course it should take less time.

Upvotes: 1

What GCC Optimization Is Occurring In This For Loop?

Update:

Answers (4)

Related Questions