bestrong
bestrong

Reputation: 151

Two times looping 1D with different configurations

Does anyone know, why Program B faster than Program A is?

I used ifort-16 with -fast optimization flag and the optimization reports say that Program A would give estimated potential speed up 10.09, while Program B only 3.90. But, actually the running time of Program B is 14s, while Program A is 20s.

!Program A
 DO J=1, 100000          !This is the different part
   !$OMP SIMD
    DO I=1, 100000
       IF(A(I)==J) THEN
          B(I)=J
       END IF          
    END DO
   !$OMP END SIMD
 END DO

!Program B
 DO I=1, 100000          !This is the different part
   !$OMP SIMD
    DO J=1, 100000
       IF(A(I)==J) THEN
          B(I)=J
       END IF          
    END DO
   !$OMP END SIMD
 END DO   

Well, both programs were successfully vectorized and somehow my feeling says that program A would be faster, since (in my opinion), both codes would be vectorized as follows:

!Program A
 IF(A(I)==J) THEN
    B(I)=J
 END IF

 IF(A(I+1)==J) THEN
    B(I+1)=J
 END IF

... 

and

!Program B
 IF(A(I)==J) THEN
    B(I)=J
 END IF

 IF(A(I)==J+1) THEN
    B(I)=J+1
 END IF

... 

where Program A will be more effective, since the left-hand-side indexes are directly computed. But in fact, my expectations were wrong. Thanks in advance.

Upvotes: 0

Views: 46

Answers (1)

innoSPG
innoSPG

Reputation: 4656

The run time of a program includes many components. The one that we most of the time look at is the computation time. However, we also have the memory access that is the bottleneck of most programs. There are other components but I will limit myself to these two. In your case, the memory access is potentially what is making the difference.

In program B, the internal loop (with J index) that runs 100000 access the same memory space A(I) and B(I) for each iteration of the external loop (I index). Once those are loaded into registers, there is no need to go to memory anymore. The program is going to the memory only when the index I of the outer loop changes. That is 100000 times

In program A, the internal loop (with I index) that runs 100000 access different memory locations A(I) and B(I) for each iteration of the internal loop. Since there is no way you can have the 100000 in the registers, the CPU will have to wait for the data. And since the outer loop (J index) is also run 100000 times, you have 100000x100000 memory access for program A compared to the 100000 for the program B.

That is one possible explanation of what you observed.

Upvotes: 1

Related Questions