Luke Davis
Luke Davis

Reputation: 2666

Why is the Fortran intrinsic function "spread" often slower than explicit iteration

I work with geophysical models and a common situation is needing to multiply, add, etc. 2D data with 3D data. Below is an example.

module benchmarks
  implicit none
  integer, parameter :: n=500
  integer :: k
  real :: d2(n,n)
  real :: d3(n,n,n)
  contains
  ! Iteration
  subroutine benchmark_a(res)
    real, intent(out) :: res(n,n,n)
    do k = 1, size(d3,3)
      res(:,:,k) = d2*d3(:,:,k)
    end do
  end subroutine
  ! Spread
  subroutine benchmark_b(res)
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 3, size(d3,3))
  end subroutine
end module

program main
  use benchmarks
  real :: t, tarray(2)
  real :: res(n,n,n)
  call random_number(d2)
  call random_number(d3)
  ! Iteration
  call dtime(tarray, t)
  call benchmark_a(res)
  call dtime(tarray, t)
  write(*,*) 'Iteration', t
  ! Spread
  call dtime(tarray, t)
  call benchmark_b(res)
  call dtime(tarray, t)
  write(*,*) 'Spread', t
end program

When I run this with varying dimension size n, I generally find spread is much much slower; for example:

Spread   2.09942889
Iteration  0.458283991

Does anyone know why the spread approach rather than an explicit for loop (which I thought were, generally, to be avoided at all costs) is so much slower?

Upvotes: 2

Views: 584

Answers (1)

Steve Lionel
Steve Lionel

Reputation: 7267

The basic answer here is "it isn't". Maybe with a specific compiler and specific circumstances, the intrinsic isn't as well-optimized as an explicit DO loop, but it doesn't have to be that way. I tested with ifort 19, and even at default optimization levels, the SPREAD intrinsic and the explicit loop generated similar code, with the intrinsic being faster when I correct the program to use the result.

Iteration 0.2187500 0.1376885 Spread 9.3750000E-02 0.1376885

I would also caution (as I did in the comments to your question) that simplistic benchmark programs often don't measure what the author thinks they do. The most common error, which your original and revised examples both exhibit, is that the result of the work-under-test is never used so a sufficiently clever compiler can simply evaporate the entire operation. Indeed when I build both your test cases with ifort 19, the compiler completely removes all the work leaving only the timing code. Needless to say, that runs pretty fast.

  implicit none
  integer, parameter :: n=500
  integer :: k
  real :: d2(n,n)
  real :: d3(n,n,n)
  contains
  ! Iteration
  subroutine benchmark_a(res)
    real, intent(out) :: res(n,n,n)
    do k = 1, size(d3,3)
      res(:,:,k) = d2*d3(:,:,k)
    end do
  end subroutine
  ! Spread
  subroutine benchmark_b(res)
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 3, size(d3,3))
  end subroutine
end module

program main
  use benchmarks
  real :: tstart,tend
  real :: res(n,n,n)
  call random_number(d2)
  call random_number(d3)
  ! Iteration
  call cpu_time(tstart)
  call benchmark_a(res)
  call cpu_time(tend)
  write(*,*) 'Iteration', tend-tstart, res(10,10,10)
  ! Spread
  call cpu_time(tstart)
  call benchmark_b(res)
  call cpu_time(tend)
  write(*,*) 'Spread', tend-tstart, res(10,10,10)
end program```

Upvotes: 5

Related Questions