Reputation: 45
First of all my english is not good. I'm Sorry.
As far as I know. Fortran address is column major. My old Fortran code is not optimized for long time. I try to change my Fortran90 code index for better speed.
A code is almost 3-dimension matrix. (i, j, k) and almost Do-loop is about i and j. sizes of i and j are about 2000~3000 and k is just 2, it means x,y
my old code's index order is (i, k, j)
for example
Do j = 1 : 1500
Do i = 1 : 1024
AA(i, 1, j) = ... ;
AA(i, 2, j) = ... ;
end do
end do
There are a lot of these in my code.
So I changed the index order. for example (i, j, k), (k, i, j), (i, k, j) I think (k, i, j) is the best choice in fortran (column major).
but result is not.
all 3 case [ (i, j, k), (k, i, j), (i, k, j) ] are spend almost time. (1961s, 1955s, 1692s).
My program code is so long and Iteration is enough to compare ( 32000 )
Below is my compile option.
ifort -O3 -xHost -ipo -qopenmp -fp-model strict -mcmodel=medium
I don't understand above result. Please help me.
Thanks to read.
additionaly, below is one of my programs. matrix L_X(i, :, j) is my target, : is 1 and 2
!$OMP Parallel DO private(j,i,ii,Tan,NormT)
do j=1,LinkPlusBndry
if (Kmax(j)>2) then
i=1; Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(2:Kmax(j)-1,:,j)=L_X(3:Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
do i=Kmax(j)+1,3,-1; L_X(i,:,j)=L_X(i-1,:,j); end do
L_X(2,:,j)=(L_X(1,:,j)+L_X(3,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
do i=2,M-1
if (i > (Kmax(j)-2) ) exit
Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(i,:,j)=(L_X(i,:,j)+L_X(i+1,:,j))/2.0_dp
L_X(i+1:Kmax(j)-1,:,j)=L_X(i+2:Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
do ii=Kmax(j)+1,i+2,-1; L_X(ii,:,j)= L_X(ii-1,:,j); end do
L_X(i+1,:,j)=(L_X(i,:,j)+L_X(i+2,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end do
i=Kmax(j)-1;
if (i>1) then
Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(Kmax(j)-1,:,j)=L_X(Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
L_X(Kmax(j)+1,:,j)= L_X(Kmax(j),:,j)
L_X(Kmax(j),:,j)=(L_X(Kmax(j)-1,:,j)+L_X(Kmax(j)+1,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end if
elseif (Kmax(j)==2) then
i=1; Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT > max_dist) then
do i=Kmax(j)+1,3,-1; L_X(i,:,j)=L_X(i-1,:,j); end do
L_X(2,:,j)=(L_X(1,:,j)+L_X(3,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end if
do i=Kmax(j)+1,M; L_X(i,:,j)=L_X(Kmax(j),:,j); end do
end do
!$OMP End Parallel DO
Upvotes: 0
Views: 326
Reputation: 1447
I would not worry so much about loop ordering. ifort -O3 optimization is an aggressive loop optimizer. Its possible that reordering your 3-D arrays will have little to no affect.
As far as you thinking (k,i,j) is the best order. In general this would be best. But k only has 2 elements and i has 1024. Assuming you are using single precision real (4 bytes) This 2-D segment of your 3-D array fit in 8K ram. It is likely that your data, once the loop starts, is entirely on the CPU cache so index ordering would be irrelevant. You need much larger data dimensions for the effect your considering to take effect.
As far as your performance difference, that is likely the struggles of compiler optimizations.
Upvotes: 2