Reputation: 2666
I work with geophysical models and a common situation is needing to multiply, add, etc. 2D data with 3D data. Below is an example.
module benchmarks
implicit none
integer, parameter :: n=500
integer :: k
real :: d2(n,n)
real :: d3(n,n,n)
contains
! Iteration
subroutine benchmark_a(res)
real, intent(out) :: res(n,n,n)
do k = 1, size(d3,3)
res(:,:,k) = d2*d3(:,:,k)
end do
end subroutine
! Spread
subroutine benchmark_b(res)
real, intent(out) :: res(n,n,n)
res = d3*spread(d2, 3, size(d3,3))
end subroutine
end module
program main
use benchmarks
real :: t, tarray(2)
real :: res(n,n,n)
call random_number(d2)
call random_number(d3)
! Iteration
call dtime(tarray, t)
call benchmark_a(res)
call dtime(tarray, t)
write(*,*) 'Iteration', t
! Spread
call dtime(tarray, t)
call benchmark_b(res)
call dtime(tarray, t)
write(*,*) 'Spread', t
end program
When I run this with varying dimension size n
, I generally find spread
is much much slower; for example:
Spread 2.09942889
Iteration 0.458283991
Does anyone know why the spread
approach rather than an explicit for loop (which I thought were, generally, to be avoided at all costs) is so much slower?
Upvotes: 2
Views: 584
Reputation: 7267
The basic answer here is "it isn't". Maybe with a specific compiler and specific circumstances, the intrinsic isn't as well-optimized as an explicit DO loop, but it doesn't have to be that way. I tested with ifort 19, and even at default optimization levels, the SPREAD intrinsic and the explicit loop generated similar code, with the intrinsic being faster when I correct the program to use the result.
Iteration 0.2187500 0.1376885
Spread 9.3750000E-02 0.1376885
I would also caution (as I did in the comments to your question) that simplistic benchmark programs often don't measure what the author thinks they do. The most common error, which your original and revised examples both exhibit, is that the result of the work-under-test is never used so a sufficiently clever compiler can simply evaporate the entire operation. Indeed when I build both your test cases with ifort 19, the compiler completely removes all the work leaving only the timing code. Needless to say, that runs pretty fast.
implicit none
integer, parameter :: n=500
integer :: k
real :: d2(n,n)
real :: d3(n,n,n)
contains
! Iteration
subroutine benchmark_a(res)
real, intent(out) :: res(n,n,n)
do k = 1, size(d3,3)
res(:,:,k) = d2*d3(:,:,k)
end do
end subroutine
! Spread
subroutine benchmark_b(res)
real, intent(out) :: res(n,n,n)
res = d3*spread(d2, 3, size(d3,3))
end subroutine
end module
program main
use benchmarks
real :: tstart,tend
real :: res(n,n,n)
call random_number(d2)
call random_number(d3)
! Iteration
call cpu_time(tstart)
call benchmark_a(res)
call cpu_time(tend)
write(*,*) 'Iteration', tend-tstart, res(10,10,10)
! Spread
call cpu_time(tstart)
call benchmark_b(res)
call cpu_time(tend)
write(*,*) 'Spread', tend-tstart, res(10,10,10)
end program```
Upvotes: 5