Stefano Borini
Stefano Borini

Reputation: 143755

What causes the runtime difference in this trivial fortran code?

I observed a very curious effect in this trivial program

module Moo 
contains
   subroutine main()
      integer :: res 
      real :: start, finish
      integer :: i

      call cpu_time(start)

      do i = 1, 1000000000
         call Squared(5, res) 
      enddo
      call cpu_time(finish)

      print '("Time = ",f6.3," seconds.")',finish-start
   end subroutine

   subroutine Squared(v, res)
      integer, intent(in) :: v
      integer, intent(out) :: res 

      res = v*v 
   end subroutine 

!   subroutine main2()
!      integer :: res
!      real :: start, finish
!      integer :: i
!
!      call cpu_time(start)
!      
!      do i = 1, 1000000000
!         res = v*v
!      enddo
!      call cpu_time(finish)
!
!      print '("Time = ",f6.3," seconds.")',finish-start
!   end subroutine

end module
program foo 
   use Moo 
   call main()
!   call main2()
end program

Compiler is gfortran 4.6.2 on mac. If I compile with -O0 and run the program, the timing is 4.36 seconds. If I uncomment the subroutine main2(), but not its call, the timing becomes 4.15 seconds on average. If I also uncomment the call main2() the first timing becomes 3.80 and the second 1.86 (understandable, I have no function call).

I compared the assembler produced in the second and third cases (routine uncommented; call commented and uncommented) and they are exactly the same, save for the actual invocation of the main2 routine.

How can the code get this performance increase from a call to a routine that is going to happen in the future, and basically no difference in the resulting code?

Upvotes: 2

Views: 373

Answers (2)

milancurcic
milancurcic

Reputation: 6241

First thing I noticed was that your program is way too short for proper benchmarking. How many runs do you use to average? What is the standard deviation? I added a nested do loop to your code to make it longer:

do i = 1, 1000000000
  do j=1,10
    call Squared(5, res) 
  enddo
enddo

I looked at only case 1 and case 2 (main2 commented and uncommented) because case 3 is different and irrelevant for this comparison. I would expect a slight increase in runtime in case 2, because of needing to load a larger executable into memory, even though that part is not used in the program.

So I did timing (3 runs each) for cases 1 and 2, for three compilers:

pgf90 10.6-0 64-bit target on x86-64 Linux -tp istanbul-64

Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.2.137 Build 20110112

GNU Fortran (GCC) 4.1.2 20080704 (Red Hat 4.1.2-51)

on AMD Opteron(tm) Processor 6134

The output of my script is:

exp 1 with pgf90:
Time = 30.619 seconds.
Time = 30.620 seconds.
Time = 30.686 seconds.
exp 2 with pgf90:
Time = 30.606 seconds.
Time = 30.693 seconds.
Time = 30.635 seconds.
exp 1 with ifort:
Time = 77.412 seconds.
Time = 77.381 seconds.
Time = 77.395 seconds.
exp 2 with ifort:
Time = 77.834 seconds.
Time = 77.853 seconds.
Time = 77.825 seconds.
exp 1 with gfortran:
Time = 68.713 seconds.
Time = 68.659 seconds.
Time = 68.650 seconds.
exp 2 with gfortran:
Time = 71.923 seconds.
Time = 74.857 seconds.
Time = 72.126 seconds.

Notice the time difference between case 1 and case 2 is largest for gfortran, and smallest for pgf90.

EDIT: After Stefano Borini pointed out that I overlooked the fact that only the looping is being benchmarked using call to cpu_time, executable load-time is out of the equation. Answer by AShelley suggests a possible reason for this. For longer runtimes, the difference between the 2 cases becomes minimal. Still - I observe a significant difference in case of gfortran (see above)

Upvotes: 6

AShelly
AShelly

Reputation: 35520

I think @IRO-bot has the right answer, but I would like to point out that code placement can influence timing, even for identical assembly.

I have 2 embedded applications running on identical processors. Each has the same hand-code assembly routine to provide the tightest possible busy-loop (for inserting sub-microsecond delays). I was recently suprised to learn that in one app, the loop took 50%! longer than the other one. Both generated the exact same assembly.

It turns out that in one executable, the starting address of the loop body allowed it to fall entirely within the processor's sole instruction cache line. On the slower one, the same function started at an address which caused it to span two lines. The extra fetch required dominated the timing of such a tight loop.

So it is possible to find instances where adding unexecuted code will affect code timing, due a change in the instruction caching sequence.

Upvotes: 5

Related Questions