Intrinsic dot_product slower than a*a+b*b+c*c?

Question

Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.

  program test

  real*8 , dimension(3) :: idmat
  real*8 :: dummy(3)

  idmat=0
  dummy=0

  do i=1,3

      idmat(i)=1

  enddo

  do j=1,10**10

  !   dummy(mod(j,3)+1)=dot_product(idmat,idmat)
      dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)

  enddo

  print*, dummy

  end program test

Here is what confuses me:

1. No -O3 Optimization

If I use: gfortran test.f90 -o test ; time ./test

I find a runtime of 6,297s using the function dot_product (commented above) and 4,486s using the manual explicit writing. How does that make sense?

2. Including -O3 Optimization

If I use: gfortran test.f90 -O3 -o test ; time ./test

I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.

3. What I actually expect

...is the intrinsic function to be faster, as it could:

compute the 3 products in parallel
add the 3 products

where the explicit form has to sequentially:

compute product 1
compute product 2
compute product 3
add the 3 products

Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?

Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.

Vladimir F Героям слава · Accepted Answer

It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.

"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"

There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.

Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3. You should not expect this to happen when no optimizations are enabled.

The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd or !$DEC VECTOR.

"Do I have to create a new parallel dot_product function to be faster?"

Yes, normally you do. For example using OpenMP. Or you could:

"Or is there an additional option for the gfortran compiler which i don't know?"

Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4

Note that it will not make those individual multiplications in parallel, it will make the i loop parallel.

Intrinsic dot_product slower than aa+bb+c*c?

Here is what confuses me:

1. No -O3 Optimization

2. Including -O3 Optimization

3. What I actually expect

Answers (1)

Related Questions