innoSPG
innoSPG

Reputation: 4656

Big overhead when subroutine is in separate module versus in the same file as the main program

I am evaluating the overhead cost (in wall clock time) of some features in fortran programs. And I came across the following behavior with GNU fortran, that I did not expect: having the subroutine in the same file as the main program (in the contain region or in a module) versus having the subroutine in a separate module (in separate file) has a big impact.

The simple code that reproduces the behavior is: I have a subroutine that does a matrix-vector multiplication 250000 times. In the first test, I have a subroutine in the contain region of the main program. In the second test, the same subroutine is in a separate module. The difference in performance between the two is big.

The subroutine in the contain region of the main program, 10 runs yields

min: 1.249
avg: 1.266
1.275 - 1.249 - 1.264 - 1.279 - 1.266 - 1.253 - 1.271 - 1.251 - 1.269 - 1.284

The subroutine in separate module, 10 runs yields

min: 1.848
avg: 1.861
1.848 - 1.862 - 1.853 - 1.871 - 1.854 - 1.883 - 1.810 - 1.860 - 1.886 - 1.884

About 50% slower, this factor seems consistent with the size of the matrix as well as the number of iterations. those tests are done with gfortran 4.8.5. With gfortran 8.3.0, the program runs a little faster, but the time doubles from the subroutine in the contain section of the main program to the subroutine in a separate module.

Portland group does not have that problem with my test program and it run even faster than the best case of gfortran.

If I read the size of the matrix from an input file (or runtime command line arg) and do dynamic allocation, then the difference in wall clock time goes away and both cases run slower (wall clock time of the subroutine in the separate module, separate file). I suspect that gfortran is able to optimize the main program better if the size of the matrix is known at compile time in the main program.

What am I doing wrong that GNU Compilers do not like, or what is GNU compiler doing poorly? Are there compiling flags to to help gfortran in such cases?

Everything is compiled with optimization -O3

Code (test_simple.f90)

!< @file test_simple.f90
!! simple test
!>
!
program test_simple
    !
    use iso_fortran_env
    use test_mod
    !
implicit none
    !
    integer, parameter :: N = 100
    integer, parameter :: N_TEST = 250000
    logical, parameter :: GENERATE=.false.
    !
    real(real64), parameter :: dx = 10.0_real64
    real(real64), parameter :: lx = 40.0_real64
    !
    real(real64), dimension(N,N) :: A
    real(real64), dimension(N) :: x, y
    real(real64) :: start_time, end_time
    real(real64) :: duration
    !
    integer :: k, loop_idx
    !
    call make_matrix(A,dx,lx)
    x = A(N/2,:)
    ! 
    y = 0
    call cpu_time( start_time )
    call axpy_loop (A, x, y, N_TEST)
    !call axpy_loop_in (A, x, y, N_TEST)
    !
    call cpu_time( end_time )
    !
    duration = end_time-start_time
    !
    if( duration < 0.01 )then
        write( *, "('Total time:',f10.6)" ) duration
    else
        write( *, "('Total time:',f10.3)" ) duration 
    end if
    !
    write(*,"('Sum = ',ES14.5E3)") sum(y)
    !
contains
    !
    !< @brief compute y = y + A^nx
    !! @param[in] A matrix to use
    !! @param[in] x vector to used
    !! @param[in, out] y output
    !! @param[in] nloop number of iterations, power to apply to A
    !! 
    !>
    subroutine axpy_loop_in (A, x, y, nloop)
        real(real64), dimension(:,:), intent(in) :: A
        real(real64), dimension(:), intent(in) :: x
        real(real64), dimension(:), intent(inout) :: y
        integer, intent(in) :: nloop
        !
        real(real64), dimension(size(x)) :: z
        integer :: k, iter
        !
        y = x
        do iter = 1, nloop
            z = y
            y = 0
            do k = 1, size(A,2)
                y = y + A(:,k)*z(k)
            end do 
        end do
        !
    end subroutine axpy_loop_in
    !
    !> @brief Computes the square exponential correlation kernel matrix for
    !! a 1D uniform grid, using coordinate vector and scalar parameters
    !! @param [in, out] C square matrix of correlation (kernel)
    !! @param [in] dx grid spacing
    !! @param [in] lx decorrelation length
    !!
    !! The correlation betwen the grid points i and j is given by
    !! \f$ C(i,j) = \exp(\frac{-(xi-xj)^2}{2l_xi l_xj}) \f$
    !! where xi and xj are respectively the coordinates of point i and j
    !>
    subroutine make_matrix(C, dx, lx)
        ! some definitions of the square correlation
        ! uses 2l^2 while other use l^2
        ! l^2 is used here by setting this factor to 1.
        real(real64), parameter :: factor = 1.0
        !
        real(real64), dimension(:,:), intent(in out) :: C
        real(real64), intent(in) :: dx
        real(real64) lx
        ! Local variables
        real(real64), dimension(size(x)) :: nfacts
        real :: dist, denom
        integer :: ii, jj
        !
        do jj=1, size(C,2)
            do ii=1, size(C,1)
                dist  = (ii-jj)*dx
                denom = factor*lx*lx
                C(ii, jj) = exp( -dist*dist/denom )
            end do
            ! compute normalization factors
            nfacts(jj) = sqrt( sum( C(:, jj) ) )
        end do
        !
        ! normalize to prevent arbitrary growth in those tests
        ! where we apply the exponential of the matrix
        do jj=1, size(C,2)
            do ii=1, size(C,1)
                C(ii, jj) = C(ii, jj)/( nfacts(ii)*nfacts(jj) )
            end do
        end do
        ! remove the very small
        where( C<epsilon(1.) ) C=0.
        !
    end subroutine make_matrix
    !
end program test_simple
!

Code (test_mod.f90)

!> @file test_mod.f90
!! simple operations
!<

!< @brief module for simple operations
!!
!>
module test_mod
    use iso_fortran_env
implicit none

contains
    !
    !< @brief compute y = y + A^nx
    !! @param[in] A matrix to use
    !! @param[in] x vector to used
    !! @param[in, out] y output
    !! @param[in] nloop number of iterations, power to apply to A
    !! 
    !>
    subroutine axpy_loop( A, x, y, nloop )
        real(real64), dimension(:,:), intent(in) :: A
        real(real64), dimension(:), intent(in) :: x
        real(real64), dimension(:), intent(inout) :: y
        integer, intent(in) :: nloop
        !
        real(real64), dimension(size(x)) :: z
        integer :: k, iter
        !
        y = x
        do iter = 1, nloop
            z = y
            y = 0
            do k = 1, size(A,2)
                y = y + A(:,k)*z(k)
            end do 
        end do
        !
    end subroutine axpy_loop
    !
end module test_mod

compile as

gfortran -O3 -o simple test_mod.f90 test_simple.f90

run as

./simple

Upvotes: 2

Views: 180

Answers (1)

innoSPG
innoSPG

Reputation: 4656

The combination of flags -march=native and -flto is the solution to the problem, at least on my testing computers. With those options, the program is fully optimized and there is no difference between having the subroutine in the same file as the main program, or in a separate file (separate module). In addition, the runtime is comparable to the runtime with Portland Group compiler. Any one of these options alone did not solved the problem. -march=native alone speeds the in contain version but makes the module version worse.

My biased thinking is that the option -march=native should be default; users doing something else are experienced and know what they are doing so they can add the appropriate option or disable the default, whereas the common user will not easily think of it.

Thank you for all the comments.

Upvotes: 1

Related Questions