Reputation: 5559
I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable
was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast
and -fast
, respectively.
program matrix_multiply
implicit none
integer, parameter :: n = 1500
real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
The timings are as follows:
! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays
! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays
My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?
At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:
program testArrays
implicit none
integer, parameter :: m = 1223, n = 2015
real(8), parameter :: pi = acos(-1.d0)
real(8) :: a(m,n)
real(8), allocatable :: b(:,:)
real(8), pointer :: c(:,:)
integer :: i, sz = min(m, n), t0, t1, count_rate, count_max
allocate( b(m,n), c(m,n) )
call random_seed()
call random_number(a)
call random_number(b)
call random_number(c)
call system_clock(t0, count_rate, count_max)
do i=1,1000
call doit(a,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( a(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doit(b,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( b(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doitp(c,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( c(1:3,1:3) )
contains
subroutine doit(a,sz)
real(8) :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
subroutine doitp(a,sz)
real(8), pointer :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
end program testArrays
ifort
timings:
Time plain: 2.857000, sum 3x3 = -.9913536
Time alloc: 2.750000, sum 3x3 = .4471794
Time p.ptr: 2.786000, sum 3x3 = 2.036269
gfortran
timings, however, are much longer but follow my expectation:
Time plain: 51.5600014, sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007, sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034, sum 3x3 = -2.1546109819149963
Upvotes: 2
Views: 669
Reputation: 1845
This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,
program matrix_multiply
implicit none
integer, parameter :: n = 1500
!real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,
ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe
gives, in fact, the opposite of what you observe:
Elapsed time: 1.580000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334 ! allocatable arrays
As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.
Also, do not use real(8)
for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env
intrinsic module, like:
!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...
This intrinsic module has the following kinds,
int8 ! 8-bit integer
int16 ! 16-bit integer
int32 ! 32-bit integer
int64 ! 64-bit integer
real32 ! 32-bit real
real64 ! 64-bit real
real128 ! 128-bit real
So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:
program complex
use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
! use iso_fortran_env, only: RK => real64, output_unit
complex(RK) :: z = (1._RK, 2._RK)
write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex
which gives:
$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000
Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env
, like output_unit
which is the unit number for the preconnected standard output unit (the same one that is used by print
or write
with a unit specifier of *
), as well as several others like compiler_version()
, compiler_options()
, and more.
Upvotes: 1
Reputation: 21461
To get an idea whether the compiler thinks there is a difference, look at the generated assembly for the procedures. Based on a quick look here, the assembly for the timed section of the two cases for the first example appears to be more or less equivalent, in terms of the work that the processor has to do. This is as expected, because the arrays presented to the timed section are more or less equivalent - they are large, contiguous, not overlapping and with element values only known at runtime.
(Beyond the compiler, there can then be differences due to the way data presents in the various caches at runtime, but that should be similar for both cases as well.)
The main difference between explicit shape and allocatable arrays is in the time that it takes to allocate and deallocate the storage for the latter. There are only four allocations at most in your first example (so it is not likely to onerous relative to subsequent calculations), and you don't time that part of the program. Stick the allocation/implicit deallocation pair inside a loop, then see how you go.
Arrays with the pointer or target attribute may be subject to aliasing, so the compiler may have to do extra work to account for the possibility of storage for the arrays overlapping. However the nature of the expression in the second example (only the one array is referenced) is such that the compiler likely knows that there is no need for the extra work in this particular case, and the operations become equivalent again.
In response to "I thought allocatable was only needed for large arrays that don't fit into the stack" - allocatable is needed (i.e. you have no real choice) when you cannot determine the size or other characteristics of the thing being allocated in the specification part of the procedure responsible for the entirety of the existence of the thing. Even for things not known until runtime, if you can still determine the characteristics in the specification part of the relevant procedure, then automatic variables are an option. (There are no automatic variables in your example though - in the non-allocatable, non-pointer cases, all the characteristics of the arrays are known at compile time.) At a Fortran processor implementation level, which varies between compilers and compile options, automatic variables may require more stack space than is available, and this can cause problems that allocatables may alleviate (or you can just change compiler options).
Upvotes: 1