Nick Brady
Nick Brady

Reputation: 151

Why is element-wise matrix row-exchanging more efficient than array-wise row-exchanging in Fortran?

I have some Fortran code that performs a matrix row exchange. In the legacy code, it is written as

do J = 1,N
   save_val  = B(IROW,J)
   B(IROW,J) = B(JCOL,J)
   B(JCOL,J) = save_val
end do

This exchanges row IROW with row JCOL (IROW and JCOL are integers). However, the function of that block of code is not intuitive. In my opinion, it would be more intuitive, or at least aid readability to write it as:

save_row  = B(IROW,:)
B(IROW,:) = B(JCOL,:)
B(JCOL,:) = save_row

(it is more clear that rows are being moved).

From the image attached, it is clear that the looping methods provide better performance relative to the array operations. Why is this? Is it because this becomes a memory limited process when the number of elements in the array gets large? (i.e. the array would get "chunked") Or is it something else?

Compiling as gfortran -O3 test.f95. Adding the flag fstack-arrays did not make significant difference.

Timings of different fortran executions of matrix row exchanges

program test

  implicit none

  integer  :: N
  integer  :: M
  integer  :: loop_max = 1e7
  integer  :: i                    ! loop index
  real     :: t1, t2
  real     :: time_loop, time_array, time_sub_loop, time_sub_array

  real, dimension(:, :), allocatable   :: B
  real, dimension(:)   , allocatable   :: save_row

  real :: save_val
  integer :: IROW, J, JCOL

  character(*), parameter :: format_header = '(A5, 1X, 4(A12,1X))'
  character(*), parameter :: format_data = '(I5, 1X, 4(ES12.5, 1X))'


  open(1, file = 'TimingRowExchange.txt', status = 'unknown')
  write(1, format_header) 'N', 't_loop', 't_array', 't_sub_loop', 't_sub_array'

  do N = 1, 100
    M = N + 1
    allocate(B(N,N), save_row(M))
    call random_number(B)

    JCOL = 1
    IROW = 3

    call CPU_time(t1)
    do i = 1, loop_max
      do J = 1,N
        save_val  = B(IROW,J)
        B(IROW,J) = B(JCOL,J)
        B(JCOL,J) = save_val
      end do
    end do
    call CPU_time(t2)
    time_loop = t2 - t1
    ! write ( *, * ) 'Using Loop =', t2 - t1


    call CPU_time(t1)
    do i = 1, loop_max
        save_row(1:N) = B(IROW,:)
        B(IROW,:)     = B(JCOL,:)
        B(JCOL,:)     = save_row(1:N)
    end do
    call CPU_time(t2)
    time_array = t2 - t1
    ! write ( *, * ) 'Using Array =', t2 - t1

    call CPU_time(t1)
    do i = 1, loop_max
      call exchange_rows_loop(B, JCOL, IROW)
    end do
    call CPU_time(t2)
    time_sub_loop = t2 - t1
    ! write ( *, * ) 'Loop Subrout =', t2 - t1


    call CPU_time(t1)
    do i = 1, loop_max
      call exchange_rows_array(B, JCOL, IROW)
    end do
    call CPU_time(t2)
    time_sub_array = t2 - t1
    ! write ( *, * ) 'Array Subrout =', t2 - t1

    deallocate(B, save_row)
    write(1, format_data) N, time_loop, time_array, time_sub_loop, time_sub_array
  end do


contains


  subroutine print_mat(A)
    implicit none
    real, dimension(:,:), intent(in) :: A
    integer :: n

    n = size(A,1) ! # of rows

    do i = 1,n
      print*, A(i,:)
    end do
    print*,

  end subroutine print_mat



  subroutine exchange_rows_loop(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2

    integer :: J
    real :: save_val

    do J = 1, size(A,1)
      save_val  = A(row1,J)
      A(row1,J) = A(row2,J)
      A(row2,J) = save_val
    end do

  end subroutine exchange_rows_loop



  subroutine exchange_rows_array(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2

    real, dimension(size(A,1))           :: save_row

    save_row  = A(row1,:)
    A(row1,:) = A(row2,:)
    A(row2,:) = save_row

  end subroutine exchange_rows_array


end program test

Upvotes: 3

Views: 489

Answers (2)

Those two versions do not do the same, the ordering is different and the optimizations done by the compiler are different. The assembly is available at https://godbolt.org/z/hc8zcs

The loop:

    movss   xmm0, DWORD PTR [rax+12]
    movss   xmm1, DWORD PTR [rax+4]
    movss   DWORD PTR [rax+4], xmm0
    movss   DWORD PTR [rax+12], xmm1
    cmp     ebx, 1
    je      .L6
    movss   xmm0, DWORD PTR [rsi]
    movss   xmm1, DWORD PTR [rcx]
    movss   DWORD PTR [rsi], xmm1
    movss   DWORD PTR [rcx], xmm0
    cmp     ebx, 2
    je      .L6
    movss   xmm0, DWORD PTR [r8]
    movss   xmm1, DWORD PTR [rdi]
    movss   DWORD PTR [r8], xmm1
    movss   DWORD PTR [rdi], xmm0

It is somewhat unrolled, the vectorized instructions (SSE) are used. Overall it is very straightforward.

The subarray version:

        mov     r10, QWORD PTR [rsp+216]
        mov     rsi, QWORD PTR [rsp+208]
        mov     r8d, 10000000
        mov     rdx, QWORD PTR [rsp+152]
        mov     rdi, QWORD PTR [rsp+144]
        mov     r11, r10
        lea     rbx, [r10+1]
        lea     r15, [r10+2]
        mov     r9, QWORD PTR [rsp+8]
        imul    r11, rsi
        lea     rax, [rdx+1]
        mov     rcx, QWORD PTR [rsp+224]
        imul    rbx, rsi
        lea     r12, [rdx+3]
        imul    r15, rsi
        sal     rsi, 2
        lea     rbp, [r12+r11]
        add     rdx, r11
        add     r11, rax
        lea     r14, [r12+rbx]
        lea     r13, [rdi+rdx*4]
        add     rbx, rax
        add     r12, r15
        add     r15, rax
        mov     QWORD PTR [rsp], r15
        mov     r15, QWORD PTR [rsp+80]
.L13:
        movss   xmm0, DWORD PTR [rdi+rbp*4]
        movss   DWORD PTR [r9], xmm0
        cmp     r15, 1
        je      .L10
        movss   xmm0, DWORD PTR [rdi+r14*4]
        movss   DWORD PTR [r9+4], xmm0
        cmp     r15, 2
        je      .L11
        movss   xmm0, DWORD PTR [rdi+r12*4]
        movss   DWORD PTR [r9+8], xmm0
.L11:
        cmp     r10, rcx
        jg      .L65
.L37:
        mov     rax, r13
        mov     rdx, r10
.L14:
        movss   xmm0, DWORD PTR [rax+4]
        add     rdx, 1
        movss   DWORD PTR [rax+12], xmm0
        add     rax, rsi
        cmp     rcx, rdx
        jge     .L14
        movss   xmm0, DWORD PTR [r9]
        movss   DWORD PTR [rdi+r11*4], xmm0
        cmp     r15, 1
        je      .L15
.L35:
        movss   xmm0, DWORD PTR [r9+4]
        movss   DWORD PTR [rdi+rbx*4], xmm0
        cmp     r15, 3
        jne     .L15
        movss   xmm0, DWORD PTR [r9+8]
        mov     rax, QWORD PTR [rsp]
        movss   DWORD PTR [rdi+rax*4], xmm0

There is a lot of pointer arithmetic, comparison and branching involved. I suspect that the compiler has to check for the aliasing of the subarrays or for zero array extents and it is harder perform efficient vectorization.

Upvotes: 1

Federico Perini
Federico Perini

Reputation: 1416

My understanding of Fortran's philosophy (strength) is that the language should help the user focus on the science, while taking care of most computer-related stuff such as optimization-for-speed, garbage collection, etc.

Functional programming style via pure/elemental functions and subroutines is IMHO one of the greatest tools that has been introduced yet underutilized, because it makes code clear, simpler and more robust.

So I've added one more test with an elemental swapping routine:

  subroutine exchange_rows_elemental(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2
    call swap(A(row1,:),A(row2,:))
  end subroutine exchange_rows_elemental  

  elemental subroutine swap(a,b)
     real, intent(inout) :: a,b
     real :: save_val
     save_val = a
     a = b
     b = save_val
  end subroutine swap  

And in the main:

call CPU_time(t1)
do i = 1, loop_max
  call exchange_rows_elemental(B, JCOL, IROW)
end do
call CPU_time(t2)
time_elemental = t2 - t1
! write ( *, * ) 'Elemental =', t2 - t1

This is what I get with gfortran 9.2.0 on Windows:

CPU time comparison adding the elemental version proposed in this answer

The elemental version is almost as fast as the fastest loop versions, but it likely operates in a vectorized way. I'm sure in this case the compiler is probably inlining the swap routine (maybe it wouldn't be able to if it was in another file), but still, telling the compiler that the swap routine can be vectorized probably helps it achieve optimal performance. I like it as a good way to make best usage of compiler optimization without cluttering the source code with nested loops and loop variables.

Upvotes: 2

Related Questions