Reputation: 101
I'm working on optimizing array access patterns in Fortran.
I allocated and accessed the array considering that Fortran allocates multi-dimensional arrays in column-major order.
However, I'm observing that accessing arrays in a way that should be column-major is actually slower than accessing them in a row-major sense. This is counter-intuitive, and I'm trying to understand why this is happening.
Here's the code I'm using to imitate my situation:
program test
implicit none
integer :: data_len, sub_len
double precision, allocatable :: array_1d(:), array_2d(:,:)
double precision, allocatable :: data1(:), data2(:), data3(:), data4(:), data5(:), temp(:)
integer :: i, j, idx
integer(kind=8) :: tic1, toc1, tic2, toc2, rate
call system_clock(count_rate=rate)
data_len = 1000000
sub_len = 1000
allocate(data1(data_len))
allocate(data2(data_len))
allocate(data3(data_len))
allocate(data4(data_len))
allocate(data5(data_len))
allocate(temp(data_len))
call random_number(data1)
call random_number(data2)
call random_number(data3)
call random_number(data4)
call random_number(data5)
! Case 1: 2D Array accessed in row-major sense
allocate(array_2d(data_len,sub_len))
call system_clock(tic1)
do i=1,data_len
array_2d(i,1) = data1(i)
array_2d(i,2) = data2(i)
array_2d(i,3) = data3(i)
array_2d(i,4) = data4(i)
array_2d(i,5) = data5(i)
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,5
temp(i) = temp(i) + array_2d(i,j)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 2D Array (row-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_2d)
! Case 2: 2D Array accessed in column-major sense
allocate(array_2d(sub_len,data_len))
call system_clock(tic1)
do i=1,data_len
array_2d(1,i) = data1(i)
array_2d(2,i) = data2(i)
array_2d(3,i) = data3(i)
array_2d(4,i) = data4(i)
array_2d(5,i) = data5(i)
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,5
temp(i) = temp(i) + array_2d(j,i)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 2D Array (column-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_2d)
! Case 3: 1D Array simulating row-major access
allocate(array_1d(data_len*sub_len))
call system_clock(tic1)
do i=1,data_len
idx = i+data_len*(1-1)
array_1d(idx) = data1(i)
idx = i+data_len*(2-1)
array_1d(idx) = data2(i)
idx = i+data_len*(3-1)
array_1d(idx) = data3(i)
idx = i+data_len*(4-1)
array_1d(idx) = data4(i)
idx = i+data_len*(5-1)
array_1d(idx) = data5(i)
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,5
idx = i+data_len*(j-1)
temp(i) = temp(i) + array_1d(idx)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 1D Array (row-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_1d)
! Case 4: 1D Array simulating column-major access
allocate(array_1d(data_len*sub_len))
call system_clock(tic1)
do i=1,data_len
idx = sub_len*(i-1)+1
array_1d(idx) = data1(i)
idx = sub_len*(i-1)+2
array_1d(idx) = data2(i)
idx = sub_len*(i-1)+3
array_1d(idx) = data3(i)
idx = sub_len*(i-1)+4
array_1d(idx) = data4(i)
idx = sub_len*(i-1)+5
array_1d(idx) = data5(i)
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,5
idx = sub_len*(i-1)+j
temp(i) = temp(i) + array_1d(idx)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 1D Array (column-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_1d)
end program test
To explain briefly, there are five one-dimensional arrays of fixed 'data_len' size that contain the data needed for calculations (from data1(:) to data5(:)).
The operation is performed after collecting the five data into array_2d(:,:).
The size of array_2d is array_2d(sub_len,data_len), where sub_len is greater than or equal to 5 to account for scalability as the number of data sets increases.
In my situation, sub_len << data_len, and both data collection and operation are performed based on the data set loop (1:5) first, so I thought it would be ideal to put sub_len in row and data_len in column.
However, as sub_len increases (but the number of data set is fixed as 5), the results suggest the opposite, row-major sense (array_2d(data_len,sub_len)) shows higher performance.
I assume that this might be a limitation of 2D arrays, so I tried to implement it by flattening it into a 1D array (array_1d(sub_len*data_len)) and accessing it with each access pattern, but it produced the same result.
Here's the results when data_len is fixed at 1,000,000 and sub_len is varied from 5 to 10 to 100 to 1,000:
# sub_len = 5
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.184600000000000E-02 sec
Summation : 6.276000000000000E-03 sec
Total : 2.812200000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.804600000000000E-02 sec
Summation : 2.429000000000000E-03 sec
Total : 2.047500000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.110700000000000E-02 sec
Summation : 3.313000000000000E-03 sec
Total : 2.442000000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.812300000000000E-02 sec
Summation : 2.420000000000000E-03 sec
Total : 2.054300000000000E-02 sec
# sub_len = 10
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.166600000000000E-02 sec
Summation : 6.287000000000000E-03 sec
Total : 2.795400000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.487700000000000E-02 sec
Summation : 3.780000000000000E-03 sec
Total : 3.865700000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.131900000000000E-02 sec
Summation : 3.284000000000000E-03 sec
Total : 2.460300000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.415100000000000E-02 sec
Summation : 3.813000000000000E-03 sec
Total : 3.796400000000000E-02 sec
# sub_len = 100
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.176100000000000E-02 sec
Summation : 6.263000000000000E-03 sec
Total : 2.802400000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.237360000000000E-01 sec
Summation : 8.404999999999999E-03 sec
Total : 3.321410000000000E-01 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.089700000000000E-02 sec
Summation : 3.283000000000000E-03 sec
Total : 2.418000000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.255980000000000E-01 sec
Summation : 8.126000000000000E-03 sec
Total : 3.337240000000000E-01 sec
# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.155700000000000E-02 sec
Summation : 6.220000000000000E-03 sec
Total : 2.777700000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.634920000000000E+00 sec
Summation : 1.298100000000000E-02 sec
Total : 1.647901000000000E+00 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 2.075800000000000E-02 sec
Summation : 3.214000000000000E-03 sec
Total : 2.397200000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.645899000000000E+00 sec
Summation : 1.286800000000000E-02 sec
Total : 1.658767000000000E+00 sec
When the number of actual data sets is the same as the size of sub_len (sub_len=5), the column major is faster as expected. However, as the size of sub_len increases and the array becomes 'sparse', the time for the column major increases significantly while the time for the row major remains the same.
This trend is the same regardless of the compiler type (Intel Fortran, gfortran) and the compiler optimization level (O0, O2, O3).
What could be the reason for this?
Edit 1) This trend is also observed when the number of datasets is set to 'sub_len' instead of being fixed to 5. In this case, the summation part was faster in column-major regardless of sub_len, but the data fetching part was faster in row-major, as in the previous test. The slowdown in the data fetching part is so great that row-major always outperforms when sub_len is large in terms of total time.
Edit 2) I found that calling 'allocate' does not actually allocate the entire size to memory immediately. To account this, I initialized array_2d = 1.d0 after the allocate call (before time measurement). At this time, when sub_len=1000, the result is:
# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.348000000000000E-03 sec
Summation : 5.835000000000000E-03 sec
Total : 9.183000000000000E-03 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.675400000000000E-02 sec
Summation : 1.279400000000000E-02 sec
Total : 2.954800000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 3.185000000000000E-03 sec
Summation : 3.084000000000000E-03 sec
Total : 6.269000000000000E-03 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour - 2.500099033103292E+06
Data fetching : 1.671800000000000E-02 sec
Summation : 1.249400000000000E-02 sec
Total : 2.921200000000000E-02 sec
Compared to the previous measurement, the speed of column-major has definitely improved, but it is still slower than row-major.
Edit 3) compiler version and option was,
Intel
Version : ifort (IFORT) 2021.12.0 20240222
Compile option : only -O0, -O2 or -O3 (such as ifort -O3 test.f90)
GNU
Version : GNU Fortran (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Compile option : only -O0, -O2 or -O3 (such as gfortran -O3 test.f90)
My system information is,
OS : Ubuntu 22.04.4 LTS
CPU : AMD Ryzen Threadripper 3970X 32-Core Processor
RAM : 128GB
Edit 4) Supplementary results for 'Edit 1'
code
program test
implicit none
integer :: data_len, sub_len
double precision, allocatable :: array_1d(:), array_2d(:,:)
double precision, allocatable :: temp(:)
type :: data_set_st
double precision, allocatable :: data(:)
end type data_set_st
type(data_set_st), allocatable :: data_set(:)
integer :: i, j, idx
integer(kind=8) :: tic1, toc1, tic2, toc2, rate
call system_clock(count_rate=rate)
data_len = 1000000
sub_len = 1000
allocate(data_set(sub_len))
do i=1,sub_len
allocate(data_set(i)%data(data_len))
call random_number(data_set(i)%data)
enddo
allocate(temp(data_len))
! Case 1: 2D Array accessed in row-major sense
allocate(array_2d(data_len,sub_len))
array_2d=1.d0
call system_clock(tic1)
do i=1,data_len
do j=1,sub_len
array_2d(i,j) = data_set(j)%data(i)
enddo
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,sub_len
temp(i) = temp(i) + array_2d(i,j)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 2D Array (row-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_2d)
! Case 2: 2D Array accessed in column-major sense
allocate(array_2d(sub_len,data_len))
array_2d=1.d0
call system_clock(tic1)
do i=1,data_len
do j=1,sub_len
array_2d(j,i) = data_set(j)%data(i)
enddo
enddo
call system_clock(toc1)
call system_clock(tic2)
do i=1,data_len
temp(i) = 0.d0
do j=1,sub_len
temp(i) = temp(i) + array_2d(j,i)
enddo
enddo
call system_clock(toc2)
write(*,'(A)') '--------------------------------------------------------------------------------'
write(*,'(A)') '# 2D Array (column-major sense)'
write(*,'(A,ES23.15)') 'Compiler optimization detour - ', sum(temp)
write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Summation : ', dble(toc2-tic2)/dble(rate), ' sec'
write(*,'(A,ES23.15,A)') 'Total : ', dble(toc2-tic1)/dble(rate), ' sec'
deallocate(array_2d)
end program test
result
# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour - 4.999942139593881E+08
Data fetching : 4.374860000000000E-01 sec
Summation : 4.080160000000000E-01 sec
Total : 8.455020000000000E-01 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour - 4.999942139593881E+08
Data fetching : 8.858646000000000E+00 sec
Summation : 3.595860000000000E-01 sec
Total : 9.218232000000000E+00 sec
Upvotes: 10
Views: 419
Reputation: 2688
Your code does almost nothing in terms of computations, so it is dominated by the memory accesses. In case 1 you have 5 "active" cache lines in the L1 cache memory (the fastest one, closest to the core) and they are fully used. In case 2 you have only 1 active cache line in the L1 cache memory, and it is not fully used.
A cache line is generally 64 bytes, hence it can contain 8 double precision real.
Let's see what happens in case 1:
array(1:8,1)
, array(1:8,2)
, array(1:8,3)
, array(1:8,4)
,array(1:8,5)
array(9:16,1)
, array(9:16,2)
, array(9:16,3)
, array(9:16,4)
,array(9:16,5)
Now for case 2:
array(1:8,1)
; only array(1:5,1)
are usedarray(1:8,2)
; only array(1:5,2)
are usedSo, case 1 makes a better usage of the cache memory overall. Moreover, when multiple cache lines are involved, I guess that the hardware is able to perform computations using a loaded cache line at the same time it loads other cache lines.
Upvotes: 10