Reputation: 413
If I use gfortran (Homebrew GCC 8.2.0)
on my Mac to compile the simple program below without optimization (-O0
) the call to matmul
consistently executes in ~90 milliseconds. If I use any optimization (flags -O1
, -O2
or -O3
) the execution time increases to ~250 milliseconds. I've tried using a wide range of different sizes for inVect
and matrix
but in all cases the -O0
option outperforms the other three optimization flags by at least a factor of 2.5. If I use smaller matrices with just a few hundred elements but loop over many calls to matmul the performance hit is even worse, close to a factor of 10.
Is there a way I can avoid this behavior? I need to use optimization in some portions of my code but, at the same time, I also would like to perform the matrix multiplication as efficiently as possible.
I compile the file sandbox.f90
containing the code below with the command gfortran -ON sandbox.f90
, where N
is an optimization level 0-3 (no other compiler flags are used). The first value of outVect
is printed solely to keep the gfortran
optimization from being clever and skipping the call to matmul
altogether.
I'm Fortran novice so I apologize in advance if I am missing something obvious here.
program main
implicit none
real :: inVect(20000), matrix(20000,10000), outVect(10000)
real :: start, finish
call random_number(inVect)
call random_number(matrix)
call cpu_time(start)
outVect = matmul(inVect, matrix)
call cpu_time(finish)
print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main
Upvotes: 3
Views: 580
Reputation: 190
First, consider that I may be wrong. I just saw this problem for the first time, and I'm as surprized as you.
I just studied this problem and I understand it as follow. The optimization -O0
, O3
, Ofast
and... are written for most general (frequent) cases. However, in some cases (when -O3
is less efficient than -O*<-O3
) the optimization induces a drawback. This is due to the fact that these optimizations call implicitly flags that induce a lower execution time for the specific task. For your case, the -O3
imposes, amongst other, that all matmul()
function will be inlined. Such a thing is generally good, but not necessary true for big array or multiple call of this function. Somehow, the cost of inlining matmul()
is more significant than the gain obtained for an inline function (at least this is how I see it).
To avoid this behavior, I suggest the use of the flag -O3 -finline-matmul-limit=0
which cancel the inlining of matmul
function. Using the flag -O3 -finline-matmul-limit=0
leads to an execution time that is not worst than what is obtained for -O0
.
You can use -finline-matmul-limit=n
where you will inline the matmul
function only if the involved array are smaller than n
. I use n=0
for simplicity.
I hope that this help you.
Upvotes: 1