Reputation: 5434
I'm running mpirun (OpenMPI) with 86 processes on 12 CPUs and 2 GPUs on Ubuntu 18.04. The application that is being run is training neural networks.
After a day or so of training the iterations slow down dramatically. The code works fine on a single thread, network traffic (file reads) are well within spec and the CPUs and GPUs show no excessive load.
So I think that problem is with the mpirun.
Are there non-intrusive tools available to show the performance of the MPI runs? I've been looking at Performance Co-Pilot but I don't see any MPI profiling in the software itself.
Upvotes: 1
Views: 264
Reputation: 1123
Callgrind and kcachegrind might be useful. A brief look here [1] may help you as well.
[1] https://www.open-mpi.org/faq/?category=debugging#parallel-debuggers
Upvotes: 1