AdhityaRavi
AdhityaRavi

Reputation: 79

How to efficiently calculate matrix inner product in Fortran?

I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows

! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW

! --> Out
real(kind=8), intent(out) :: innerProd

! --> Local
integer :: ni, nj, nk, nVar, iVar

! --> Computing inner product
do iVar = 1, nVar
    innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo

But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,

import numpy as np 
import os

# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'   

innerProd = 0

# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))

# --> Reshaping 
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))

# --> Calculating inner product
for iVar in range(nVar):
    innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)

The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130, the time taken by the two methods are as follows

 fortran_time = 25.8641 s
 numpy_time = 6.8924 s

I tried improving my Fortran code with ddot from BLAS as follows,

do iVar = 1, nVar
    do k = 1, nk
        do j = 1, nj
            innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
        enddo
    enddo
enddo

But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130 is ~24s. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).

Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.

Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?

Upvotes: 2

Views: 1451

Answers (3)

Ed Smith
Ed Smith

Reputation: 13196

I had the same observation comparing Numpy and Fortran code.

The difference turns out to be the version of BLAS, I found using DGEMM from netlib is similar to looping and about three times slower than OpenBLAS (see profiles in this answer).

The most surprising thing for me was that OpenBLAS provides code which is so much faster than just compiling a Fortran triple nested loop. It seems this is the whole point of GotoBLAS, which was handwritten in assembly code for the processor architecture.

Even timing the right thing, ordering loops correctly, avoiding copies and using every optimising flag (in gfortran), the performance is still about three times slower than OpenBLAS. I've not tried ifort or pgi, but I wonder if this explains the upvoted comment by @kvantour "loop finishes in 0.6s for me" (note intrinsic matmul is replaced by BLAS in some implementations).

Upvotes: 1

ptb
ptb

Reputation: 2148

You may not be timing what you think are timing. Here's a complete fortran example

program test                                                        
    use iso_fortran_env, r8 => real64                               
    implicit none                                                   

    integer, parameter :: ni = 130, nj = 130, nk = 130, nvar = 130  
    real(r8), allocatable :: u1(:,:,:,:), u2(:,:,:,:), w(:,:,:)     
    real(r8) :: sum, t0, t1                                         
    integer :: i,j,k,n                                              

    call cpu_time(t0)                                               
    allocate(u1(ni,nj,nk,nvar))                                     
    allocate(u2(ni,nj,nk,nvar))                                     
    allocate(w(ni,nj,nk))                                           
    call cpu_time(t1)                                               
    write(*,'("allocation time(s):",es15.5)') t1-t0                 

    call cpu_time(t0)                                               
    call random_seed()                                              
    call random_number(u1)                                          
    call random_number(u2)                                          
    call random_number(w)                                           
    call cpu_time(t1)                                               
    write(*,'("random init time (s):",es15.5)') t1-t0               

    sum = 0.0_r8                                                    
    call cpu_time(t0)                                               
    do n = 1, nvar                                                  
        do k = 1, nk                                                
            do j = 1, nj                                            
                do i = 1, ni                                        
                    sum = sum + u1(i,j,k,n)*u2(i,j,k,n)*w(i,j,k)    
                end do                                              
            end do                                                  
        end do                                                      
    end do                                                          
    call cpu_time(t1)                                               
    write(*,'("Sum:",es15.5," time(s):",es15.5)') sum, t1-t0        

end program

And the output:

$ gfortran -O2 -o inner_product inner_product.f90            
$ time ./inner_product 
allocation time(s):    3.00000E-05
random init time (s):    5.73293E+00
Sum:    3.57050E+07 time(s):    5.69066E-01

real    0m6.465s
user    0m4.634s
sys 0m1.798s

Computing the inner product is less that 10% of the runtime in this fortran code. How/What you are timing is very important. Are you sure you are timing the same things in the fortran and python versions? Are you sure you are only timing the inner_product calculation?

Upvotes: 2

agentp
agentp

Reputation: 6989

This avoids making any copy. (note the blas ddot approach still needs to make a copy for the element-wise product)

   subroutine dot3(n,a,b,c,result)
   implicit none
   real(kind=..) a(*),b(*),c(*),result
   integer i,n
   result=0
   do i=1,n
    result=result+a(i)*b(i)*c(i)
   enddo
   end

dot3 is external, meaning not in a module/contains construct. kind should obviously match main declaration.

in main code:

  innerprod=0
  do iVar = 1, nVar 
  call dot3(ni*nj*nk, U1(1,1,1,iVar),U2(1,1,1,iVar),intW,result)
  innerProd=innerProd+result
  enddo

Upvotes: 1

Related Questions