Big overhead when subroutine is in separate module versus in the same file as the main program

Question

I am evaluating the overhead cost (in wall clock time) of some features in fortran programs. And I came across the following behavior with GNU fortran, that I did not expect: having the subroutine in the same file as the main program (in the contain region or in a module) versus having the subroutine in a separate module (in separate file) has a big impact.

The simple code that reproduces the behavior is: I have a subroutine that does a matrix-vector multiplication 250000 times. In the first test, I have a subroutine in the contain region of the main program. In the second test, the same subroutine is in a separate module. The difference in performance between the two is big.

The subroutine in the contain region of the main program, 10 runs yields

min: 1.249
avg: 1.266
1.275 - 1.249 - 1.264 - 1.279 - 1.266 - 1.253 - 1.271 - 1.251 - 1.269 - 1.284

The subroutine in separate module, 10 runs yields

min: 1.848
avg: 1.861
1.848 - 1.862 - 1.853 - 1.871 - 1.854 - 1.883 - 1.810 - 1.860 - 1.886 - 1.884

About 50% slower, this factor seems consistent with the size of the matrix as well as the number of iterations. those tests are done with gfortran 4.8.5. With gfortran 8.3.0, the program runs a little faster, but the time doubles from the subroutine in the contain section of the main program to the subroutine in a separate module.

Portland group does not have that problem with my test program and it run even faster than the best case of gfortran.

If I read the size of the matrix from an input file (or runtime command line arg) and do dynamic allocation, then the difference in wall clock time goes away and both cases run slower (wall clock time of the subroutine in the separate module, separate file). I suspect that gfortran is able to optimize the main program better if the size of the matrix is known at compile time in the main program.

What am I doing wrong that GNU Compilers do not like, or what is GNU compiler doing poorly? Are there compiling flags to to help gfortran in such cases?

Everything is compiled with optimization -O3

Code (test_simple.f90)

!< @file test_simple.f90
!! simple test
!>
!
program test_simple
    !
    use iso_fortran_env
    use test_mod
    !
implicit none
    !
    integer, parameter :: N = 100
    integer, parameter :: N_TEST = 250000
    logical, parameter :: GENERATE=.false.
    !
    real(real64), parameter :: dx = 10.0_real64
    real(real64), parameter :: lx = 40.0_real64
    !
    real(real64), dimension(N,N) :: A
    real(real64), dimension(N) :: x, y
    real(real64) :: start_time, end_time
    real(real64) :: duration
    !
    integer :: k, loop_idx
    !
    call make_matrix(A,dx,lx)
    x = A(N/2,:)
    ! 
    y = 0
    call cpu_time( start_time )
    call axpy_loop (A, x, y, N_TEST)
    !call axpy_loop_in (A, x, y, N_TEST)
    !
    call cpu_time( end_time )
    !
    duration = end_time-start_time
    !
    if( duration < 0.01 )then
        write( *, "('Total time:',f10.6)" ) duration
    else
        write( *, "('Total time:',f10.3)" ) duration 
    end if
    !
    write(*,"('Sum = ',ES14.5E3)") sum(y)
    !
contains
    !
    !< @brief compute y = y + A^nx
    !! @param[in] A matrix to use
    !! @param[in] x vector to used
    !! @param[in, out] y output
    !! @param[in] nloop number of iterations, power to apply to A
    !! 
    !>
    subroutine axpy_loop_in (A, x, y, nloop)
        real(real64), dimension(:,:), intent(in) :: A
        real(real64), dimension(:), intent(in) :: x
        real(real64), dimension(:), intent(inout) :: y
        integer, intent(in) :: nloop
        !
        real(real64), dimension(size(x)) :: z
        integer :: k, iter
        !
        y = x
        do iter = 1, nloop
            z = y
            y = 0
            do k = 1, size(A,2)
                y = y + A(:,k)*z(k)
            end do 
        end do
        !
    end subroutine axpy_loop_in
    !
    !> @brief Computes the square exponential correlation kernel matrix for
    !! a 1D uniform grid, using coordinate vector and scalar parameters
    !! @param [in, out] C square matrix of correlation (kernel)
    !! @param [in] dx grid spacing
    !! @param [in] lx decorrelation length
    !!
    !! The correlation betwen the grid points i and j is given by
    !! \f$ C(i,j) = \exp(\frac{-(xi-xj)^2}{2l_xi l_xj}) \f$
    !! where xi and xj are respectively the coordinates of point i and j
    !>
    subroutine make_matrix(C, dx, lx)
        ! some definitions of the square correlation
        ! uses 2l^2 while other use l^2
        ! l^2 is used here by setting this factor to 1.
        real(real64), parameter :: factor = 1.0
        !
        real(real64), dimension(:,:), intent(in out) :: C
        real(real64), intent(in) :: dx
        real(real64) lx
        ! Local variables
        real(real64), dimension(size(x)) :: nfacts
        real :: dist, denom
        integer :: ii, jj
        !
        do jj=1, size(C,2)
            do ii=1, size(C,1)
                dist  = (ii-jj)*dx
                denom = factor*lx*lx
                C(ii, jj) = exp( -dist*dist/denom )
            end do
            ! compute normalization factors
            nfacts(jj) = sqrt( sum( C(:, jj) ) )
        end do
        !
        ! normalize to prevent arbitrary growth in those tests
        ! where we apply the exponential of the matrix
        do jj=1, size(C,2)
            do ii=1, size(C,1)
                C(ii, jj) = C(ii, jj)/( nfacts(ii)*nfacts(jj) )
            end do
        end do
        ! remove the very small
        where( C



Code (test_mod.f90)

!> @file test_mod.f90
!! simple operations
!<

!< @brief module for simple operations
!!
!>
module test_mod
    use iso_fortran_env
implicit none

contains
    !
    !< @brief compute y = y + A^nx
    !! @param[in] A matrix to use
    !! @param[in] x vector to used
    !! @param[in, out] y output
    !! @param[in] nloop number of iterations, power to apply to A
    !! 
    !>
    subroutine axpy_loop( A, x, y, nloop )
        real(real64), dimension(:,:), intent(in) :: A
        real(real64), dimension(:), intent(in) :: x
        real(real64), dimension(:), intent(inout) :: y
        integer, intent(in) :: nloop
        !
        real(real64), dimension(size(x)) :: z
        integer :: k, iter
        !
        y = x
        do iter = 1, nloop
            z = y
            y = 0
            do k = 1, size(A,2)
                y = y + A(:,k)*z(k)
            end do 
        end do
        !
    end subroutine axpy_loop
    !
end module test_mod


compile as

gfortran -O3 -o simple test_mod.f90 test_simple.f90


run as

./simple

Big overhead when subroutine is in separate module versus in the same file as the main program

Answers (1)

Related Questions