Reputation: 33
Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.
program test
real*8 , dimension(3) :: idmat
real*8 :: dummy(3)
idmat=0
dummy=0
do i=1,3
idmat(i)=1
enddo
do j=1,10**10
! dummy(mod(j,3)+1)=dot_product(idmat,idmat)
dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)
enddo
print*, dummy
end program test
If I use: gfortran test.f90 -o test ; time ./test
I find a runtime of 6,297s using the function dot_product
(commented above) and 4,486s using the manual explicit writing.
How does that make sense?
If I use: gfortran test.f90 -O3 -o test ; time ./test
I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.
...is the intrinsic function to be faster, as it could:
where the explicit form has to sequentially:
Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?
Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.
Upvotes: 3
Views: 900
Reputation: 59998
It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.
"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"
There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.
Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3
. You should not expect this to happen when no optimizations are enabled.
The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd
or !$DEC VECTOR
.
"Do I have to create a new parallel dot_product function to be faster?"
Yes, normally you do. For example using OpenMP. Or you could:
"Or is there an additional option for the gfortran compiler which i don't know?"
Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4
Note that it will not make those individual multiplications in parallel, it will make the i
loop parallel.
Upvotes: 1