Reputation: 151
Does anyone know, why Program B
faster than Program A
is?
I used ifort-16
with -fast
optimization flag and the optimization reports say that Program A
would give estimated potential speed up 10.09, while Program B
only 3.90. But, actually the running time of Program B
is 14s, while Program A
is 20s.
!Program A
DO J=1, 100000 !This is the different part
!$OMP SIMD
DO I=1, 100000
IF(A(I)==J) THEN
B(I)=J
END IF
END DO
!$OMP END SIMD
END DO
!Program B
DO I=1, 100000 !This is the different part
!$OMP SIMD
DO J=1, 100000
IF(A(I)==J) THEN
B(I)=J
END IF
END DO
!$OMP END SIMD
END DO
Well, both programs were successfully vectorized and somehow my feeling says that program A
would be faster, since (in my opinion), both codes would be vectorized as follows:
!Program A
IF(A(I)==J) THEN
B(I)=J
END IF
IF(A(I+1)==J) THEN
B(I+1)=J
END IF
...
and
!Program B
IF(A(I)==J) THEN
B(I)=J
END IF
IF(A(I)==J+1) THEN
B(I)=J+1
END IF
...
where Program A
will be more effective, since the left-hand-side indexes are directly computed. But in fact, my expectations were wrong. Thanks in advance.
Upvotes: 0
Views: 46
Reputation: 4656
The run time of a program includes many components. The one that we most of the time look at is the computation time. However, we also have the memory access that is the bottleneck of most programs. There are other components but I will limit myself to these two. In your case, the memory access is potentially what is making the difference.
In program B, the internal loop (with J
index) that runs 100000
access the same memory space A(I)
and B(I)
for each iteration of the external loop (I
index). Once those are loaded into registers, there is no need to go to memory anymore. The program is going to the memory only when the index I
of the outer loop changes. That is 100000
times
In program A, the internal loop (with I
index) that runs 100000
access different memory locations A(I)
and B(I)
for each iteration of the internal loop. Since there is no way you can have the 100000
in the registers, the CPU will have to wait for the data. And since the outer loop (J
index) is also run 100000
times, you have 100000x100000
memory access for program A compared to the 100000
for the program B.
That is one possible explanation of what you observed.
Upvotes: 1