Reputation: 1889
I have some Fortran 90 code that I've been using for finite element computations. Lately, I've been trying to improve how it solves block linear systems. Before, I had a subroutine amux
used for sparse matrix-vector multiplication and another subroutine cg
which implements the conjugate gradient method using amux
. I wrote a new matrix-vector subroutine block_amux
and likewise a new solver block_cg
. By all rights, the new method should run faster, but instead it runs 10 times slower.
In order to track down the problem, I used the profiler gprof to see what was going on. I found that 92.5% of my code was spent running the cg
subroutine -- even though I never called it, and relied exclusively on block_amux and block_cg. To muddy the waters even further, I put a print statement in the actual cg
routine saying "Hello world"; it was never printed. Finally, I noticed that gprof lists no uses of the amux
subroutine, even though a genuine call to cg would have done hundreds of ordinary matrix multiplications.
I'm mystified as to what could be doing this. Any thoughts? I can attach the gprof output if that helps too.
Update: I have made the following changes, with the same result some way or other:
cg
becomes conjugate_gradient
. Gprof then reports that I'm wasting time in the new conjugate_gradient routine.linalg_mod
in which they originally resided, then stop using the module containing the CG routine. Instead, the program wastes time in something called a "frame_dummy". This looks suspiciously similar to this post, but I can't linalg_mod
, which contains the CG routine, to a new module linalg_mod_decoy
, which does not contain it. Instead of wasting time in the CG algorithm, gprof says that the program is calling a subroutine I use to generate the right-hand side of the linear system ~3000 times instead of just once.Upvotes: 2
Views: 387
Reputation: 34398
Quoting a comment by korrok, the question author:
OpenMP was the culprit. I figured that if I set the number of threads to 1 I would get the same result as profiling without OMP at all. When I stopped compiling with OpenMP it still performed poorly but correctly reported where all the work was being done.
Upvotes: 1