Swari du
Swari du

Reputation: 101

Unexpected Slower Performance with Column-Major Access in Fortran Arrays

I'm working on optimizing array access patterns in Fortran.

I allocated and accessed the array considering that Fortran allocates multi-dimensional arrays in column-major order.

However, I'm observing that accessing arrays in a way that should be column-major is actually slower than accessing them in a row-major sense. This is counter-intuitive, and I'm trying to understand why this is happening.

Here's the code I'm using to imitate my situation:

program test
  implicit none
  integer :: data_len, sub_len
  double precision, allocatable :: array_1d(:), array_2d(:,:)
  double precision, allocatable :: data1(:), data2(:), data3(:), data4(:), data5(:), temp(:)

  integer :: i, j, idx
  integer(kind=8) :: tic1, toc1, tic2, toc2, rate

  call system_clock(count_rate=rate)

  data_len = 1000000
  sub_len = 1000

  allocate(data1(data_len))
  allocate(data2(data_len))
  allocate(data3(data_len))
  allocate(data4(data_len))
  allocate(data5(data_len))
  allocate(temp(data_len))

  call random_number(data1)
  call random_number(data2)
  call random_number(data3)
  call random_number(data4)
  call random_number(data5)

  ! Case 1: 2D Array accessed in row-major sense
  allocate(array_2d(data_len,sub_len))
  call system_clock(tic1)
  do i=1,data_len
    array_2d(i,1) = data1(i)
    array_2d(i,2) = data2(i)
    array_2d(i,3) = data3(i)
    array_2d(i,4) = data4(i)
    array_2d(i,5) = data5(i)
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,5
      temp(i) = temp(i) + array_2d(i,j)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 2D Array (row-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_2d)

  ! Case 2: 2D Array accessed in column-major sense
  allocate(array_2d(sub_len,data_len))
  call system_clock(tic1)
  do i=1,data_len
    array_2d(1,i) = data1(i)
    array_2d(2,i) = data2(i)
    array_2d(3,i) = data3(i)
    array_2d(4,i) = data4(i)
    array_2d(5,i) = data5(i)
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,5
      temp(i) = temp(i) + array_2d(j,i)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 2D Array (column-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_2d)

  ! Case 3: 1D Array simulating row-major access
  allocate(array_1d(data_len*sub_len))
  call system_clock(tic1)
  do i=1,data_len
    idx = i+data_len*(1-1)
    array_1d(idx) = data1(i)
    idx = i+data_len*(2-1)
    array_1d(idx) = data2(i)
    idx = i+data_len*(3-1)
    array_1d(idx) = data3(i)
    idx = i+data_len*(4-1)
    array_1d(idx) = data4(i)
    idx = i+data_len*(5-1)
    array_1d(idx) = data5(i)
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,5
      idx = i+data_len*(j-1)
      temp(i) = temp(i) + array_1d(idx)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 1D Array (row-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_1d)

  ! Case 4: 1D Array simulating column-major access
  allocate(array_1d(data_len*sub_len))
  call system_clock(tic1)
  do i=1,data_len
    idx = sub_len*(i-1)+1
    array_1d(idx) = data1(i)
    idx = sub_len*(i-1)+2
    array_1d(idx) = data2(i)
    idx = sub_len*(i-1)+3
    array_1d(idx) = data3(i)
    idx = sub_len*(i-1)+4
    array_1d(idx) = data4(i)
    idx = sub_len*(i-1)+5
    array_1d(idx) = data5(i)
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,5
      idx = sub_len*(i-1)+j
      temp(i) = temp(i) + array_1d(idx)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 1D Array (column-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_1d)

end program test

To explain briefly, there are five one-dimensional arrays of fixed 'data_len' size that contain the data needed for calculations (from data1(:) to data5(:)).

The operation is performed after collecting the five data into array_2d(:,:).

The size of array_2d is array_2d(sub_len,data_len), where sub_len is greater than or equal to 5 to account for scalability as the number of data sets increases.

In my situation, sub_len << data_len, and both data collection and operation are performed based on the data set loop (1:5) first, so I thought it would be ideal to put sub_len in row and data_len in column.

However, as sub_len increases (but the number of data set is fixed as 5), the results suggest the opposite, row-major sense (array_2d(data_len,sub_len)) shows higher performance.

I assume that this might be a limitation of 2D arrays, so I tried to implement it by flattening it into a 1D array (array_1d(sub_len*data_len)) and accessing it with each access pattern, but it produced the same result.

Here's the results when data_len is fixed at 1,000,000 and sub_len is varied from 5 to 10 to 100 to 1,000:

# sub_len = 5
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.184600000000000E-02 sec
Summation     :   6.276000000000000E-03 sec
Total         :   2.812200000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.804600000000000E-02 sec
Summation     :   2.429000000000000E-03 sec
Total         :   2.047500000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.110700000000000E-02 sec
Summation     :   3.313000000000000E-03 sec
Total         :   2.442000000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.812300000000000E-02 sec
Summation     :   2.420000000000000E-03 sec
Total         :   2.054300000000000E-02 sec
# sub_len = 10
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.166600000000000E-02 sec
Summation     :   6.287000000000000E-03 sec
Total         :   2.795400000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.487700000000000E-02 sec
Summation     :   3.780000000000000E-03 sec
Total         :   3.865700000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.131900000000000E-02 sec
Summation     :   3.284000000000000E-03 sec
Total         :   2.460300000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.415100000000000E-02 sec
Summation     :   3.813000000000000E-03 sec
Total         :   3.796400000000000E-02 sec
# sub_len = 100
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.176100000000000E-02 sec
Summation     :   6.263000000000000E-03 sec
Total         :   2.802400000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.237360000000000E-01 sec
Summation     :   8.404999999999999E-03 sec
Total         :   3.321410000000000E-01 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.089700000000000E-02 sec
Summation     :   3.283000000000000E-03 sec
Total         :   2.418000000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.255980000000000E-01 sec
Summation     :   8.126000000000000E-03 sec
Total         :   3.337240000000000E-01 sec
# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.155700000000000E-02 sec
Summation     :   6.220000000000000E-03 sec
Total         :   2.777700000000000E-02 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.634920000000000E+00 sec
Summation     :   1.298100000000000E-02 sec
Total         :   1.647901000000000E+00 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   2.075800000000000E-02 sec
Summation     :   3.214000000000000E-03 sec
Total         :   2.397200000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.645899000000000E+00 sec
Summation     :   1.286800000000000E-02 sec
Total         :   1.658767000000000E+00 sec

When the number of actual data sets is the same as the size of sub_len (sub_len=5), the column major is faster as expected. However, as the size of sub_len increases and the array becomes 'sparse', the time for the column major increases significantly while the time for the row major remains the same.

This trend is the same regardless of the compiler type (Intel Fortran, gfortran) and the compiler optimization level (O0, O2, O3).

What could be the reason for this?

Edit 1) This trend is also observed when the number of datasets is set to 'sub_len' instead of being fixed to 5. In this case, the summation part was faster in column-major regardless of sub_len, but the data fetching part was faster in row-major, as in the previous test. The slowdown in the data fetching part is so great that row-major always outperforms when sub_len is large in terms of total time.

Edit 2) I found that calling 'allocate' does not actually allocate the entire size to memory immediately. To account this, I initialized array_2d = 1.d0 after the allocate call (before time measurement). At this time, when sub_len=1000, the result is:

# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.348000000000000E-03 sec
Summation     :   5.835000000000000E-03 sec
Total         :   9.183000000000000E-03 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.675400000000000E-02 sec
Summation     :   1.279400000000000E-02 sec
Total         :   2.954800000000000E-02 sec
--------------------------------------------------------------------------------
# 1D Array (row-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   3.185000000000000E-03 sec
Summation     :   3.084000000000000E-03 sec
Total         :   6.269000000000000E-03 sec
--------------------------------------------------------------------------------
# 1D Array (column-major sense)
Compiler optimization detour -   2.500099033103292E+06
Data fetching :   1.671800000000000E-02 sec
Summation     :   1.249400000000000E-02 sec
Total         :   2.921200000000000E-02 sec

Compared to the previous measurement, the speed of column-major has definitely improved, but it is still slower than row-major.

Edit 3) compiler version and option was,

Intel

Version : ifort (IFORT) 2021.12.0 20240222

Compile option : only -O0, -O2 or -O3 (such as ifort -O3 test.f90)

GNU

Version : GNU Fortran (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Compile option : only -O0, -O2 or -O3 (such as gfortran -O3 test.f90)

My system information is,

OS : Ubuntu 22.04.4 LTS

CPU : AMD Ryzen Threadripper 3970X 32-Core Processor

RAM : 128GB

Edit 4) Supplementary results for 'Edit 1'

code

program test
  implicit none
  integer :: data_len, sub_len
  double precision, allocatable :: array_1d(:), array_2d(:,:)
  double precision, allocatable :: temp(:)
  type :: data_set_st
    double precision, allocatable :: data(:)
  end type data_set_st
  type(data_set_st), allocatable :: data_set(:)

  integer :: i, j, idx
  integer(kind=8) :: tic1, toc1, tic2, toc2, rate

  call system_clock(count_rate=rate)

  data_len = 1000000
  sub_len = 1000

  allocate(data_set(sub_len))
  do i=1,sub_len
    allocate(data_set(i)%data(data_len))
    call random_number(data_set(i)%data)
  enddo
  allocate(temp(data_len))

  ! Case 1: 2D Array accessed in row-major sense
  allocate(array_2d(data_len,sub_len))
  array_2d=1.d0
  call system_clock(tic1)
  do i=1,data_len
    do j=1,sub_len
      array_2d(i,j) = data_set(j)%data(i)
    enddo
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,sub_len
      temp(i) = temp(i) + array_2d(i,j)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 2D Array (row-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_2d)

  ! Case 2: 2D Array accessed in column-major sense
  allocate(array_2d(sub_len,data_len))
  array_2d=1.d0
  call system_clock(tic1)
  do i=1,data_len
    do j=1,sub_len
      array_2d(j,i) = data_set(j)%data(i)
    enddo
  enddo
  call system_clock(toc1)
  call system_clock(tic2)
  do i=1,data_len
    temp(i) = 0.d0
    do j=1,sub_len
      temp(i) = temp(i) + array_2d(j,i)
    enddo
  enddo
  call system_clock(toc2)
  write(*,'(A)')           '--------------------------------------------------------------------------------'
  write(*,'(A)')           '# 2D Array (column-major sense)'
  write(*,'(A,ES23.15)')   'Compiler optimization detour - ', sum(temp)
  write(*,'(A,ES23.15,A)') 'Data fetching : ', dble(toc1-tic1)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Summation     : ', dble(toc2-tic2)/dble(rate), ' sec'
  write(*,'(A,ES23.15,A)') 'Total         : ', dble(toc2-tic1)/dble(rate), ' sec'
  deallocate(array_2d)

end program test

result

# sub_len = 1000
--------------------------------------------------------------------------------
# 2D Array (row-major sense)
Compiler optimization detour -   4.999942139593881E+08
Data fetching :   4.374860000000000E-01 sec
Summation     :   4.080160000000000E-01 sec
Total         :   8.455020000000000E-01 sec
--------------------------------------------------------------------------------
# 2D Array (column-major sense)
Compiler optimization detour -   4.999942139593881E+08
Data fetching :   8.858646000000000E+00 sec
Summation     :   3.595860000000000E-01 sec
Total         :   9.218232000000000E+00 sec

Upvotes: 10

Views: 419

Answers (1)

PierU
PierU

Reputation: 2688

Your code does almost nothing in terms of computations, so it is dominated by the memory accesses. In case 1 you have 5 "active" cache lines in the L1 cache memory (the fastest one, closest to the core) and they are fully used. In case 2 you have only 1 active cache line in the L1 cache memory, and it is not fully used.

A cache line is generally 64 bytes, hence it can contain 8 double precision real.

Let's see what happens in case 1:

  • i=1
    • 5 cache lines are loaded: array(1:8,1), array(1:8,2), array(1:8,3), array(1:8,4),array(1:8,5)
  • i=2..8
    • the needed data is already in cache, and all the data in cache are used
  • i=9
    • 5 new cache lines are loaded: array(9:16,1), array(9:16,2), array(9:16,3), array(9:16,4),array(9:16,5)
  • i=9..16
    • the needed data is already in cache, and all the data in cache are used
  • ...

Now for case 2:

  • i=1
    • 1 cache line is loaded: array(1:8,1); only array(1:5,1) are used
  • i=2
    • 1 new cache line is loaded: array(1:8,2); only array(1:5,2) are used
  • ...

So, case 1 makes a better usage of the cache memory overall. Moreover, when multiple cache lines are involved, I guess that the hardware is able to perform computations using a loaded cache line at the same time it loads other cache lines.

Upvotes: 10

Related Questions