F.N.B
F.N.B

Reputation: 1619

OpenBLAS slower than intrinsic function dot_product

I need make a dot product in Fortran. I can do with the intrinsic function dot_product from Fortran or use ddot from OpenBLAS. The problem is the ddot is slower. This is my code:

With BLAS:

program VectorBLAS
! time VectorBlas.e = 0.30s
implicit none
double precision, dimension(3)  :: b
double precision                :: result
double precision, external      :: ddot
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
   b(:) = 3
   result = ddot(3, b, 1, b, 1)
END DO
end program VectorBLAS

With dot_product

program VectorModule
! time VectorModule.e = 0.19s
implicit none
double precision, dimension (3)  :: b
double precision                 :: result
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
  b(:) = 3
  result = dot_product(b, b)
END DO
end program VectorModule

The two codes are compiled using:

gfortran file_name.f90 -lblas -o file_name.e

What am I doing wrong? BLAS not have to be faster?

Upvotes: 3

Views: 1221

Answers (1)

Alexander Vogt
Alexander Vogt

Reputation: 18098

While BLAS, and especially the optimized versions, are generally faster for larger arrays, the built-in functions are faster for smaller sizes.

This is especially visible from the linked source code of ddot, where additional work is spent on further functionality (e.g., different increments). For small array lengths, the work done here outweighs the performance gain of the optimizations.

If you make your vectors (much) larger, the optimized version should be faster.

Here is an example to illustrate this:

program test
  use, intrinsic :: ISO_Fortran_env, only: REAL64
  implicit none
  integer                   :: t1, t2, rate, ttot1, ttot2, i
  real(REAL64), allocatable :: a(:),b(:),c(:)
  real(REAL64), external    :: ddot

  allocate( a(100000), b(100000), c(100000) )
  call system_clock(count_rate=rate)

  ttot1 = 0 ; ttot2 = 0
  do i=1,1000
    call random_number(a)
    call random_number(b)

    call system_clock(t1)
    c = dot_product(a,b)
    call system_clock(t2)
    ttot1 = ttot1 + t2 - t1

    call system_clock(t1)
    c = ddot(100000,a,1,b,1)
    call system_clock(t2)
    ttot2 = ttot2 + t2 - t1
  enddo
  print *,'dot_product: ', real(ttot1)/real(rate) 
  print *,'BLAS, ddot:  ', real(ttot2)/real(rate) 
end program

The BLAS routines are quite a bit faster here:

OMP_NUM_THREADS=1 ./a.out 
 dot_product:   0.145999998    
 BLAS, ddot:    0.100000001  

Upvotes: 4

Related Questions