AboAmmar
AboAmmar

Reputation: 5559

Plain vs. allocatable/pointer arrays, Fortran advice?

I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast and -fast, respectively.

program matrix_multiply
   implicit none
   integer, parameter :: n = 1500
   real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n)                 ! plain arrays 
   integer :: i, j, k, ts, te, count_rate, count_max
   real(8) :: tmp

   ! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:)  ! allocatable arrays
   ! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )

   do i = 1,n
      do j = 1,n
         a(i,j) = 1.d0/n/n * (i-j) * (i+j)
         b(i,j) = 1.d0/n/n * (i-j) * (i+j)
      end do 
   end do 

   ! transpose for cache-friendliness   
   do i = 1,n
      do j = 1,n
         aT(j,i) = a(i,j)
      end do 
   end do 

   call system_clock(ts, count_rate, count_max)
   do i = 1,n
      do j = 1,n
         tmp = 0 
         do k = 1,n
            tmp = tmp + aT(k,i) * b(k,j)
         end do
         c(i,j) = tmp
      end do
   end do
   call system_clock(te)
   print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)    
end program matrix_multiply

The timings are as follows:

! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays  

! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays 
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays

My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?

At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:

program testArrays
  implicit none
  integer, parameter :: m = 1223, n = 2015 
  real(8), parameter :: pi = acos(-1.d0)
  real(8) :: a(m,n)
  real(8), allocatable :: b(:,:)
  real(8), pointer :: c(:,:)
  integer :: i, sz = min(m, n), t0, t1, count_rate, count_max

  allocate( b(m,n), c(m,n) )
  call random_seed()
  call random_number(a)
  call random_number(b)
  call random_number(c)

  call system_clock(t0, count_rate, count_max)
    do i=1,1000
      call doit(a,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( a(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000
      call doit(b,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( b(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000 
      call doitp(c,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( c(1:3,1:3) )

  contains 
  subroutine doit(a,sz)
    real(8) :: a(:,:)
    integer :: sz 
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end

  subroutine doitp(a,sz)
    real(8), pointer :: a(:,:)
    integer :: sz
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end    
end program testArrays 

ifort timings:

Time plain: 2.857000,  sum 3x3 = -.9913536
Time alloc: 2.750000,  sum 3x3 = .4471794
Time p.ptr: 2.786000,  sum 3x3 = 2.036269  

gfortran timings, however, are much longer but follow my expectation:

Time plain: 51.5600014,  sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007,  sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034,  sum 3x3 = -2.1546109819149963

Upvotes: 2

Views: 669

Answers (2)

Scientist
Scientist

Reputation: 1845

This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,

program matrix_multiply
   implicit none
   integer, parameter :: n = 1500
  !real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n)                 ! plain arrays 
   integer :: i, j, k, ts, te, count_rate, count_max
   real(8) :: tmp

   real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:)  ! allocatable arrays
   allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )

   do i = 1,n
      do j = 1,n
         a(i,j) = 1.d0/n/n * (i-j) * (i+j)
         b(i,j) = 1.d0/n/n * (i-j) * (i+j)
      end do 
   end do 

   ! transpose for cache-friendliness   
   do i = 1,n
      do j = 1,n
         aT(j,i) = a(i,j)
      end do 
   end do 

   call system_clock(ts, count_rate, count_max)
   do i = 1,n
      do j = 1,n
         tmp = 0 
         do k = 1,n
            tmp = tmp + aT(k,i) * b(k,j)
         end do
         c(i,j) = tmp
      end do
   end do
   call system_clock(te)
   print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)    
end program matrix_multiply

compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,

ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe

gives, in fact, the opposite of what you observe:

Elapsed time: 1.580000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334   ! plain arrays

Elapsed time: 1.634000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334   ! allocatable arrays

As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.

Also, do not use real(8) for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env intrinsic module, like:

!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...

This intrinsic module has the following kinds,

   int8 ! 8-bit integer
  int16 ! 16-bit integer
  int32 ! 32-bit integer
  int64 ! 64-bit integer
 real32 ! 32-bit real
 real64 ! 64-bit real
real128 ! 128-bit real 

So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:

program complex
    use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
    ! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
    ! use iso_fortran_env, only: RK => real64, output_unit
    complex(RK) :: z = (1._RK, 2._RK)
    write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex

which gives:

$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000

Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env, like output_unit which is the unit number for the preconnected standard output unit (the same one that is used by print or write with a unit specifier of *), as well as several others like compiler_version(), compiler_options(), and more.

Upvotes: 1

IanH
IanH

Reputation: 21461

To get an idea whether the compiler thinks there is a difference, look at the generated assembly for the procedures. Based on a quick look here, the assembly for the timed section of the two cases for the first example appears to be more or less equivalent, in terms of the work that the processor has to do. This is as expected, because the arrays presented to the timed section are more or less equivalent - they are large, contiguous, not overlapping and with element values only known at runtime.

(Beyond the compiler, there can then be differences due to the way data presents in the various caches at runtime, but that should be similar for both cases as well.)

The main difference between explicit shape and allocatable arrays is in the time that it takes to allocate and deallocate the storage for the latter. There are only four allocations at most in your first example (so it is not likely to onerous relative to subsequent calculations), and you don't time that part of the program. Stick the allocation/implicit deallocation pair inside a loop, then see how you go.

Arrays with the pointer or target attribute may be subject to aliasing, so the compiler may have to do extra work to account for the possibility of storage for the arrays overlapping. However the nature of the expression in the second example (only the one array is referenced) is such that the compiler likely knows that there is no need for the extra work in this particular case, and the operations become equivalent again.

In response to "I thought allocatable was only needed for large arrays that don't fit into the stack" - allocatable is needed (i.e. you have no real choice) when you cannot determine the size or other characteristics of the thing being allocated in the specification part of the procedure responsible for the entirety of the existence of the thing. Even for things not known until runtime, if you can still determine the characteristics in the specification part of the relevant procedure, then automatic variables are an option. (There are no automatic variables in your example though - in the non-allocatable, non-pointer cases, all the characteristics of the arrays are known at compile time.) At a Fortran processor implementation level, which varies between compilers and compile options, automatic variables may require more stack space than is available, and this can cause problems that allocatables may alleviate (or you can just change compiler options).

Upvotes: 1

Related Questions