Why is matmul slower with gfortran compiler optimization turned on?

Question

If I use gfortran (Homebrew GCC 8.2.0) on my Mac to compile the simple program below without optimization (-O0) the call to matmul consistently executes in ~90 milliseconds. If I use any optimization (flags -O1, -O2 or -O3) the execution time increases to ~250 milliseconds. I've tried using a wide range of different sizes for inVect and matrix but in all cases the -O0 option outperforms the other three optimization flags by at least a factor of 2.5. If I use smaller matrices with just a few hundred elements but loop over many calls to matmul the performance hit is even worse, close to a factor of 10.

Is there a way I can avoid this behavior? I need to use optimization in some portions of my code but, at the same time, I also would like to perform the matrix multiplication as efficiently as possible.

I compile the file sandbox.f90 containing the code below with the command gfortran -ON sandbox.f90, where N is an optimization level 0-3 (no other compiler flags are used). The first value of outVect is printed solely to keep the gfortran optimization from being clever and skipping the call to matmul altogether.

I'm Fortran novice so I apologize in advance if I am missing something obvious here.

program main
implicit none
    real :: inVect(20000), matrix(20000,10000), outVect(10000)
    real :: start, finish

    call random_number(inVect)
    call random_number(matrix)
        
    call cpu_time(start)
    outVect = matmul(inVect, matrix)
    call cpu_time(finish)

    print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main

Noureddine · Accepted Answer

First, consider that I may be wrong. I just saw this problem for the first time, and I'm as surprized as you.

I just studied this problem and I understand it as follow. The optimization -O0, O3, Ofast and... are written for most general (frequent) cases. However, in some cases (when -O3 is less efficient than -O*<-O3) the optimization induces a drawback. This is due to the fact that these optimizations call implicitly flags that induce a lower execution time for the specific task. For your case, the -O3 imposes, amongst other, that all matmul() function will be inlined. Such a thing is generally good, but not necessary true for big array or multiple call of this function. Somehow, the cost of inlining matmul() is more significant than the gain obtained for an inline function (at least this is how I see it).

To avoid this behavior, I suggest the use of the flag -O3 -finline-matmul-limit=0 which cancel the inlining of matmul function. Using the flag -O3 -finline-matmul-limit=0 leads to an execution time that is not worst than what is obtained for -O0.

You can use -finline-matmul-limit=n where you will inline the matmul function only if the involved array are smaller than n. I use n=0 for simplicity.

I hope that this help you.

Why is matmul slower with gfortran compiler optimization turned on?

Answers (1)

Related Questions