smilingbuddha
smilingbuddha

Reputation: 14754

Reason for different speeds of floating point division with and without opt.flag -O3 (C++/C)

I was trying to measure the speed difference of single precision division vs double precision division in C++

Here is the simple code that I have written.

#include <iostream>
#include <time.h>

int main(int argc, char *argv[])
{

  float     f_x = 45672.0;
  float     f_y = 67783.0;
  double    d_x = 45672.0;
  double    d_y = 67783.0;

  float     f_answer;
  double    d_answer;

  clock_t   start,stop;
  int       N = 200000000 //2*10^8


 start = clock();
 for (int i = 0; i < N; ++i)
  {
    f_answer = f_x/f_y;
  }
 stop = clock();
 std::cout<<"Single Precision:"<< (stop-start)/(double)CLOCKS_PER_SEC<<"    "<<f_answer <<std::endl;


start = clock();
for (int i = 0; i < N; ++i)
  {
    d_answer = d_x/d_y;
  }
stop = clock();
std::cout<<"Double precision:" <<(stop-start)/(double)CLOCKS_PER_SEC<<"   "<< d_answer<<std::endl;

return 0;
}

When I compiled the code without optimization as g++ test.cpp I got the following output

Desktop: ./a.out
Single precision:8.06    0.673797
Double precision:12.68   0.673797

But if I compile this with g++ -O3 test.cpp then I get

Desktop: ./a.out
Single precision:0    0.673797
Double precision:0   0.673797

How did I get such a drastic performance increase? The time being shown in the second case is 0 because of the low resolution of the clock() function. Did the compiler somehow detect that each for loop iteration is independent of the previous iterations?

Upvotes: 2

Views: 359

Answers (3)

Omnifarious
Omnifarious

Reputation: 56088

Looking at the assembly that you get from g++ -O3 -S, it's quite apparent the loops and all of your floating point calculations (aside from those involving the time) were optimized out of existence:

        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB970:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        pushq   %rbx
        .cfi_def_cfa_offset 24
        .cfi_offset 3, -24
        subq    $24, %rsp
        .cfi_def_cfa_offset 48
        call    clock
        movq    %rax, %rbx
        call    clock
        movq    %rax, %rbp
        movl    $.LC0, %esi
        movl    std::cout, %edi
        subq    %rbx, %rbp
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

See the two calls to clock, one right after the other? And before those, only some stack maintenance instructions. Yep, those loops are completely gone.

You only use f_answer or d_answer to print out an answer that can be trivially calculated at compile time, and the compiler can see that. There's no point in even having them. And if there's no point in having them, there's no point in having f_x, f_y, d_x, or d_y either. All gone.

To solve this, you need to have each iteration of the loop depend on the results from the last iteration. Here is my solution to this problem. I use the complex template to do some calculations involved in calculating the Mandlebrot set:

#include <iostream>
#include <time.h>
#include <complex>

int main(int argc, char *argv[])
{
   using ::std::complex;
   using ::std::cout;

   const complex<float> f_coord(0.1, 0.1);
   const complex<double> d_coord(0.1, 0.1);

   complex<float> f_answer(0, 0);
   complex<double> d_answer(0, 0);

   clock_t   start, stop;
   const unsigned int N = 200000000; //2*10^8

   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      f_answer = (f_answer * f_answer) + f_coord;
   }
   stop = clock();
   cout << "Single Precision: " << (stop-start)/(double)CLOCKS_PER_SEC
        << "    " << f_answer << '\n';


   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      d_answer = (d_answer * d_answer) + d_coord;
   }
   stop = clock();
   cout << "Double precision: " <<(stop-start)/(double)CLOCKS_PER_SEC
        << "   " << d_answer << '\n';

   return 0;
}

Upvotes: 5

Alexey Frunze
Alexey Frunze

Reputation: 62106

If you add the volatile qualifier in the definitions of your floats and doubles, the compiler won't optimize away the unused calculations.

Upvotes: 1

Oliver Charlesworth
Oliver Charlesworth

Reputation: 272752

Probably because the compiler optimised the loop away to a single iteration. It may even have done the division at compile-time.

Check the assembler of your executable to be sure (use e.g. objdump).

Upvotes: 7

Related Questions