Reputation: 151
I have some Fortran code that performs a matrix row exchange. In the legacy code, it is written as
do J = 1,N
save_val = B(IROW,J)
B(IROW,J) = B(JCOL,J)
B(JCOL,J) = save_val
end do
This exchanges row IROW
with row JCOL
(IROW
and JCOL
are integers). However, the function of that block of code is not intuitive. In my opinion, it would be more intuitive, or at least aid readability to write it as:
save_row = B(IROW,:)
B(IROW,:) = B(JCOL,:)
B(JCOL,:) = save_row
(it is more clear that rows are being moved).
From the image attached, it is clear that the looping methods provide better performance relative to the array operations. Why is this? Is it because this becomes a memory limited process when the number of elements in the array gets large? (i.e. the array would get "chunked") Or is it something else?
Compiling as gfortran -O3 test.f95
. Adding the flag fstack-arrays
did not make significant difference.
program test
implicit none
integer :: N
integer :: M
integer :: loop_max = 1e7
integer :: i ! loop index
real :: t1, t2
real :: time_loop, time_array, time_sub_loop, time_sub_array
real, dimension(:, :), allocatable :: B
real, dimension(:) , allocatable :: save_row
real :: save_val
integer :: IROW, J, JCOL
character(*), parameter :: format_header = '(A5, 1X, 4(A12,1X))'
character(*), parameter :: format_data = '(I5, 1X, 4(ES12.5, 1X))'
open(1, file = 'TimingRowExchange.txt', status = 'unknown')
write(1, format_header) 'N', 't_loop', 't_array', 't_sub_loop', 't_sub_array'
do N = 1, 100
M = N + 1
allocate(B(N,N), save_row(M))
call random_number(B)
JCOL = 1
IROW = 3
call CPU_time(t1)
do i = 1, loop_max
do J = 1,N
save_val = B(IROW,J)
B(IROW,J) = B(JCOL,J)
B(JCOL,J) = save_val
end do
end do
call CPU_time(t2)
time_loop = t2 - t1
! write ( *, * ) 'Using Loop =', t2 - t1
call CPU_time(t1)
do i = 1, loop_max
save_row(1:N) = B(IROW,:)
B(IROW,:) = B(JCOL,:)
B(JCOL,:) = save_row(1:N)
end do
call CPU_time(t2)
time_array = t2 - t1
! write ( *, * ) 'Using Array =', t2 - t1
call CPU_time(t1)
do i = 1, loop_max
call exchange_rows_loop(B, JCOL, IROW)
end do
call CPU_time(t2)
time_sub_loop = t2 - t1
! write ( *, * ) 'Loop Subrout =', t2 - t1
call CPU_time(t1)
do i = 1, loop_max
call exchange_rows_array(B, JCOL, IROW)
end do
call CPU_time(t2)
time_sub_array = t2 - t1
! write ( *, * ) 'Array Subrout =', t2 - t1
deallocate(B, save_row)
write(1, format_data) N, time_loop, time_array, time_sub_loop, time_sub_array
end do
contains
subroutine print_mat(A)
implicit none
real, dimension(:,:), intent(in) :: A
integer :: n
n = size(A,1) ! # of rows
do i = 1,n
print*, A(i,:)
end do
print*,
end subroutine print_mat
subroutine exchange_rows_loop(A, row1, row2)
implicit none
real, dimension(:,:), intent(in out) :: A
integer, intent(in) :: row1, row2
integer :: J
real :: save_val
do J = 1, size(A,1)
save_val = A(row1,J)
A(row1,J) = A(row2,J)
A(row2,J) = save_val
end do
end subroutine exchange_rows_loop
subroutine exchange_rows_array(A, row1, row2)
implicit none
real, dimension(:,:), intent(in out) :: A
integer, intent(in) :: row1, row2
real, dimension(size(A,1)) :: save_row
save_row = A(row1,:)
A(row1,:) = A(row2,:)
A(row2,:) = save_row
end subroutine exchange_rows_array
end program test
Upvotes: 3
Views: 489
Reputation: 60078
Those two versions do not do the same, the ordering is different and the optimizations done by the compiler are different. The assembly is available at https://godbolt.org/z/hc8zcs
The loop:
movss xmm0, DWORD PTR [rax+12]
movss xmm1, DWORD PTR [rax+4]
movss DWORD PTR [rax+4], xmm0
movss DWORD PTR [rax+12], xmm1
cmp ebx, 1
je .L6
movss xmm0, DWORD PTR [rsi]
movss xmm1, DWORD PTR [rcx]
movss DWORD PTR [rsi], xmm1
movss DWORD PTR [rcx], xmm0
cmp ebx, 2
je .L6
movss xmm0, DWORD PTR [r8]
movss xmm1, DWORD PTR [rdi]
movss DWORD PTR [r8], xmm1
movss DWORD PTR [rdi], xmm0
It is somewhat unrolled, the vectorized instructions (SSE) are used. Overall it is very straightforward.
The subarray version:
mov r10, QWORD PTR [rsp+216]
mov rsi, QWORD PTR [rsp+208]
mov r8d, 10000000
mov rdx, QWORD PTR [rsp+152]
mov rdi, QWORD PTR [rsp+144]
mov r11, r10
lea rbx, [r10+1]
lea r15, [r10+2]
mov r9, QWORD PTR [rsp+8]
imul r11, rsi
lea rax, [rdx+1]
mov rcx, QWORD PTR [rsp+224]
imul rbx, rsi
lea r12, [rdx+3]
imul r15, rsi
sal rsi, 2
lea rbp, [r12+r11]
add rdx, r11
add r11, rax
lea r14, [r12+rbx]
lea r13, [rdi+rdx*4]
add rbx, rax
add r12, r15
add r15, rax
mov QWORD PTR [rsp], r15
mov r15, QWORD PTR [rsp+80]
.L13:
movss xmm0, DWORD PTR [rdi+rbp*4]
movss DWORD PTR [r9], xmm0
cmp r15, 1
je .L10
movss xmm0, DWORD PTR [rdi+r14*4]
movss DWORD PTR [r9+4], xmm0
cmp r15, 2
je .L11
movss xmm0, DWORD PTR [rdi+r12*4]
movss DWORD PTR [r9+8], xmm0
.L11:
cmp r10, rcx
jg .L65
.L37:
mov rax, r13
mov rdx, r10
.L14:
movss xmm0, DWORD PTR [rax+4]
add rdx, 1
movss DWORD PTR [rax+12], xmm0
add rax, rsi
cmp rcx, rdx
jge .L14
movss xmm0, DWORD PTR [r9]
movss DWORD PTR [rdi+r11*4], xmm0
cmp r15, 1
je .L15
.L35:
movss xmm0, DWORD PTR [r9+4]
movss DWORD PTR [rdi+rbx*4], xmm0
cmp r15, 3
jne .L15
movss xmm0, DWORD PTR [r9+8]
mov rax, QWORD PTR [rsp]
movss DWORD PTR [rdi+rax*4], xmm0
There is a lot of pointer arithmetic, comparison and branching involved. I suspect that the compiler has to check for the aliasing of the subarrays or for zero array extents and it is harder perform efficient vectorization.
Upvotes: 1
Reputation: 1416
My understanding of Fortran's philosophy (strength) is that the language should help the user focus on the science, while taking care of most computer-related stuff such as optimization-for-speed, garbage collection, etc.
Functional programming style via pure
/elemental
functions and subroutines is IMHO one of the greatest tools that has been introduced yet underutilized, because it makes code clear, simpler and more robust.
So I've added one more test with an elemental
swapping routine:
subroutine exchange_rows_elemental(A, row1, row2)
implicit none
real, dimension(:,:), intent(in out) :: A
integer, intent(in) :: row1, row2
call swap(A(row1,:),A(row2,:))
end subroutine exchange_rows_elemental
elemental subroutine swap(a,b)
real, intent(inout) :: a,b
real :: save_val
save_val = a
a = b
b = save_val
end subroutine swap
And in the main:
call CPU_time(t1)
do i = 1, loop_max
call exchange_rows_elemental(B, JCOL, IROW)
end do
call CPU_time(t2)
time_elemental = t2 - t1
! write ( *, * ) 'Elemental =', t2 - t1
This is what I get with gfortran 9.2.0
on Windows:
The elemental
version is almost as fast as the fastest loop versions, but it likely operates in a vectorized way. I'm sure in this case the compiler is probably inlining the swap
routine (maybe it wouldn't be able to if it was in another file), but still, telling the compiler that the swap
routine can be vectorized probably helps it achieve optimal performance. I like it as a good way to make best usage of compiler optimization without cluttering the source code with nested loops and loop variables.
Upvotes: 2