Reputation: 1209
I am using a relatively simple code parallelize with OpenMP to familiarize myself with gprof.
My code mainly consists of gathering data from input files, perform some array manipulations and write the new data to different output files. I placed some calls to the intrinsic subroutine CPU_TIME
to see if gprof was being accurate:
PROGRAM main
USE global_variables
USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
USE change_vars
IMPLICIT NONE
REAL(dp) :: t0, t1
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL allocate_data
CALL CPU_TIME(t1)
PRINT*, "Allocate data =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL build_grid
CALL CPU_TIME(t1)
PRINT*, "Build grid =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL read_old_restart
CALL CPU_TIME(t1)
PRINT*, "Read restart =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL regroup_all
CALL CPU_TIME(t1)
PRINT*, "Regroup all =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL redistribute_all
CALL CPU_TIME(t1)
PRINT*, "Redistribute =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL write_new_restart
CALL CPU_TIME(t1)
PRINT*, "Write restart =", t1 - t0
END PROGRAM main
Here is the output:
Allocate data = 1.000000000000000E-003
Build grid = 0.000000000000000E+000
Read restart = 10.7963590000000
Regroup all = 6.65998700000000
Redistribute = 14.3518180000000
Write restart = 53.5218640000000
Therefore, the write_new_restart
subroutine is the most time consuming and takes about 62% of the total run time. However according to grof, the subroutine redistribute_vars
, which is called multiple times by redistribute_all
is the most time consuming with 70% of the total time. Here is the output from gprof:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
74.40 8.95 8.95 61 0.15 0.15 change_vars_mp_redistribute_vars_
19.12 11.25 2.30 60 0.04 0.04 change_vars_mp_regroup_vars_
6.23 12.00 0.75 63 0.01 0.01 change_vars_mp_fill_last_blocks_
0.08 12.01 0.01 1 0.01 2.31 change_vars_mp_regroup_all_
0.08 12.02 0.01 __intel_ssse3_rep_memcpy
0.08 12.03 0.01 for_open
0.00 12.03 0.00 1 0.00 12.01 MAIN__
0.00 12.03 0.00 1 0.00 0.00 change_vars_mp_build_grid_
0.00 12.03 0.00 1 0.00 9.70 change_vars_mp_redistribute_all_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_read_old_restart_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_write_new_restart_
0.00 12.03 0.00 1 0.00 0.00 global_variables_mp_allocate_data_
index % time self children called name
0.00 12.01 1/1 main [2]
[1] 99.8 0.00 12.01 1 MAIN__ [1]
0.00 9.70 1/1 change_vars_mp_redistribute_all_ [3]
0.01 2.30 1/1 change_vars_mp_regroup_all_ [5]
0.00 0.00 1/1 global_variables_mp_allocate_data_ [13]
0.00 0.00 1/1 change_vars_mp_build_grid_ [10]
0.00 0.00 1/1 fileio_mp_read_old_restart_ [11]
0.00 0.00 1/1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
<spontaneous>
[2] 99.8 0.00 12.01 main [2]
0.00 12.01 1/1 MAIN__ [1]
-----------------------------------------------
0.00 9.70 1/1 MAIN__ [1]
[3] 80.6 0.00 9.70 1 change_vars_mp_redistribute_all_ [3]
8.95 0.00 61/61 change_vars_mp_redistribute_vars_ [4]
0.75 0.00 63/63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
8.95 0.00 61/61 change_vars_mp_redistribute_all_ [3]
[4] 74.4 8.95 0.00 61 change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
0.01 2.30 1/1 MAIN__ [1]
[5] 19.2 0.01 2.30 1 change_vars_mp_regroup_all_ [5]
2.30 0.00 60/60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
2.30 0.00 60/60 change_vars_mp_regroup_all_ [5]
[6] 19.1 2.30 0.00 60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
0.75 0.00 63/63 change_vars_mp_redistribute_all_ [3]
[7] 6.2 0.75 0.00 63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
<spontaneous>
[8] 0.1 0.01 0.00 for_open [8]
-----------------------------------------------
<spontaneous>
[9] 0.1 0.01 0.00 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[10] 0.0 0.00 0.00 1 change_vars_mp_build_grid_ [10]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[11] 0.0 0.00 0.00 1 fileio_mp_read_old_restart_ [11]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[12] 0.0 0.00 0.00 1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[13] 0.0 0.00 0.00 1 global_variables_mp_allocate_data_ [13]
-----------------------------------------------
For your information, regroup_all
calls regroup_vars
multiple times and redistribute_all
calls redistribute_vars
and fill_last_blocks
multiple times.
I am compiling my code with ifort
with the -openmp -O2 -pg
options.
QUESTION:
Why is gprof not seeing the time my file i/o subroutines take? (read_old_restart
, write_new_restart
)
Upvotes: 1
Views: 520
Reputation: 40649
gprof
specifically does not include I/O time. It only tries to measure CPU time.
That's because it only does two things: 1) sample the program counter on a 1/100 second clock, and the program counter is meaningless during I/O, and 2) count the number of times any function B is called by any function A.
From the call-counts, it tries to guess how much of each function's CPU time can be attributed to each caller. That's it's whole advance over pre-existing profilers.
When you use gprof
, you should understand what it does and what it doesn't do.
Upvotes: 1