Reputation: 4656
I am evaluating the overhead cost (in wall clock time) of some features in fortran programs. And I came across the following behavior with GNU fortran, that I did not expect: having the subroutine in the same file as the main program (in the contain region or in a module) versus having the subroutine in a separate module (in separate file) has a big impact.
The simple code that reproduces the behavior is: I have a subroutine that does a matrix-vector multiplication 250000 times. In the first test, I have a subroutine in the contain region of the main program. In the second test, the same subroutine is in a separate module. The difference in performance between the two is big.
The subroutine in the contain region of the main program, 10 runs yields
min: 1.249
avg: 1.266
1.275 - 1.249 - 1.264 - 1.279 - 1.266 - 1.253 - 1.271 - 1.251 - 1.269 - 1.284
The subroutine in separate module, 10 runs yields
min: 1.848
avg: 1.861
1.848 - 1.862 - 1.853 - 1.871 - 1.854 - 1.883 - 1.810 - 1.860 - 1.886 - 1.884
About 50% slower, this factor seems consistent with the size of the matrix as well as the number of iterations. those tests are done with gfortran 4.8.5. With gfortran 8.3.0, the program runs a little faster, but the time doubles from the subroutine in the contain section of the main program to the subroutine in a separate module.
Portland group does not have that problem with my test program and it run even faster than the best case of gfortran.
If I read the size of the matrix from an input file (or runtime command line arg) and do dynamic allocation, then the difference in wall clock time goes away and both cases run slower (wall clock time of the subroutine in the separate module, separate file). I suspect that gfortran is able to optimize the main program better if the size of the matrix is known at compile time in the main program.
What am I doing wrong that GNU Compilers do not like, or what is GNU compiler doing poorly? Are there compiling flags to to help gfortran in such cases?
Everything is compiled with optimization -O3
Code (test_simple.f90)
!< @file test_simple.f90
!! simple test
!>
!
program test_simple
!
use iso_fortran_env
use test_mod
!
implicit none
!
integer, parameter :: N = 100
integer, parameter :: N_TEST = 250000
logical, parameter :: GENERATE=.false.
!
real(real64), parameter :: dx = 10.0_real64
real(real64), parameter :: lx = 40.0_real64
!
real(real64), dimension(N,N) :: A
real(real64), dimension(N) :: x, y
real(real64) :: start_time, end_time
real(real64) :: duration
!
integer :: k, loop_idx
!
call make_matrix(A,dx,lx)
x = A(N/2,:)
!
y = 0
call cpu_time( start_time )
call axpy_loop (A, x, y, N_TEST)
!call axpy_loop_in (A, x, y, N_TEST)
!
call cpu_time( end_time )
!
duration = end_time-start_time
!
if( duration < 0.01 )then
write( *, "('Total time:',f10.6)" ) duration
else
write( *, "('Total time:',f10.3)" ) duration
end if
!
write(*,"('Sum = ',ES14.5E3)") sum(y)
!
contains
!
!< @brief compute y = y + A^nx
!! @param[in] A matrix to use
!! @param[in] x vector to used
!! @param[in, out] y output
!! @param[in] nloop number of iterations, power to apply to A
!!
!>
subroutine axpy_loop_in (A, x, y, nloop)
real(real64), dimension(:,:), intent(in) :: A
real(real64), dimension(:), intent(in) :: x
real(real64), dimension(:), intent(inout) :: y
integer, intent(in) :: nloop
!
real(real64), dimension(size(x)) :: z
integer :: k, iter
!
y = x
do iter = 1, nloop
z = y
y = 0
do k = 1, size(A,2)
y = y + A(:,k)*z(k)
end do
end do
!
end subroutine axpy_loop_in
!
!> @brief Computes the square exponential correlation kernel matrix for
!! a 1D uniform grid, using coordinate vector and scalar parameters
!! @param [in, out] C square matrix of correlation (kernel)
!! @param [in] dx grid spacing
!! @param [in] lx decorrelation length
!!
!! The correlation betwen the grid points i and j is given by
!! \f$ C(i,j) = \exp(\frac{-(xi-xj)^2}{2l_xi l_xj}) \f$
!! where xi and xj are respectively the coordinates of point i and j
!>
subroutine make_matrix(C, dx, lx)
! some definitions of the square correlation
! uses 2l^2 while other use l^2
! l^2 is used here by setting this factor to 1.
real(real64), parameter :: factor = 1.0
!
real(real64), dimension(:,:), intent(in out) :: C
real(real64), intent(in) :: dx
real(real64) lx
! Local variables
real(real64), dimension(size(x)) :: nfacts
real :: dist, denom
integer :: ii, jj
!
do jj=1, size(C,2)
do ii=1, size(C,1)
dist = (ii-jj)*dx
denom = factor*lx*lx
C(ii, jj) = exp( -dist*dist/denom )
end do
! compute normalization factors
nfacts(jj) = sqrt( sum( C(:, jj) ) )
end do
!
! normalize to prevent arbitrary growth in those tests
! where we apply the exponential of the matrix
do jj=1, size(C,2)
do ii=1, size(C,1)
C(ii, jj) = C(ii, jj)/( nfacts(ii)*nfacts(jj) )
end do
end do
! remove the very small
where( C<epsilon(1.) ) C=0.
!
end subroutine make_matrix
!
end program test_simple
!
Code (test_mod.f90)
!> @file test_mod.f90
!! simple operations
!<
!< @brief module for simple operations
!!
!>
module test_mod
use iso_fortran_env
implicit none
contains
!
!< @brief compute y = y + A^nx
!! @param[in] A matrix to use
!! @param[in] x vector to used
!! @param[in, out] y output
!! @param[in] nloop number of iterations, power to apply to A
!!
!>
subroutine axpy_loop( A, x, y, nloop )
real(real64), dimension(:,:), intent(in) :: A
real(real64), dimension(:), intent(in) :: x
real(real64), dimension(:), intent(inout) :: y
integer, intent(in) :: nloop
!
real(real64), dimension(size(x)) :: z
integer :: k, iter
!
y = x
do iter = 1, nloop
z = y
y = 0
do k = 1, size(A,2)
y = y + A(:,k)*z(k)
end do
end do
!
end subroutine axpy_loop
!
end module test_mod
compile as
gfortran -O3 -o simple test_mod.f90 test_simple.f90
run as
./simple
Upvotes: 2
Views: 180
Reputation: 4656
The combination of flags -march=native
and -flto
is the solution to the problem, at least on my testing computers. With those options, the program is fully optimized and there is no difference between having the subroutine in the same file as the main program, or in a separate file (separate module). In addition, the runtime is comparable to the runtime with Portland Group compiler. Any one of these options alone did not solved the problem. -march=native
alone speeds the in contain version but makes the module version worse.
My biased thinking is that the option -march=native
should be default; users doing something else are experienced and know what they are doing so they can add the appropriate option or disable the default, whereas the common user will not easily think of it.
Thank you for all the comments.
Upvotes: 1