Reputation: 2639
I got a MPI program written by other people.
Basic structure is like this
program basis
initialize MPI
do n=1,12
call mpi_job(n)
end do
finalize MPI
contains
subroutine mpi_job(n) !this is mpi subroutine
.....
end subroutine
end program
What I want to do now is to make the do loop a parallel do loop. So if I got a 24 core machine, I can run this program with 12 mpi_job
running simultaneously and each mpi_job
uses 2 threads. There are several reasons to do this, for example, the performance of mpi_job
may not scale well with number of cores. To sum up, I want to make one level of MPI parallelization into two levels of parallelization.
I found myself constantly encounter this problem when I working with other people.The question is what is the easiest and efficient way to modify the program?
Upvotes: 0
Views: 464
Reputation: 1764
There is some confusion here over the basics of what you are trying to do. Your skeleton code will not run 12 MPI jobs at the same time; each MPI process that you create will run 12 jobs sequentially.
What you want to do is run 12 MPI processes, each of which calls mpi_job a single time. Within mpi_job, you can then create 2 threads using OpenMP.
Process and thread placement is outside the scope of the MPI and OpenMP standards. For example, ensuring that the processes are spread evenly across your multicore machine (e.g. each of the 12 even cores 0, 2, ... out of 24) and that the OpenMP threads run on even and odd pairs of cores would require you to look up the man pages for your MPI and OpenMP implementations. You may be able to place processes using arguments to mpiexec; thread placement may be controlled by environment variables, e.g. KMP_AFFINITY for Intel OpenMP.
Placement aside, here is a code that I think does what you want (I make no comment on whether it is the most efficient thing to do). I am using GNU compilers here.
user@laptop$ mpif90 -fopenmp -o basis basis.f90
user@laptop$ export OMP_NUM_THREADS=2
user@laptop$ mpiexec -n 12 ./basis
Running 12 MPI jobs at the same time
MPI job 2 , thread no. 1 reporting for duty
MPI job 11 , thread no. 1 reporting for duty
MPI job 11 , thread no. 0 reporting for duty
MPI job 8 , thread no. 0 reporting for duty
MPI job 0 , thread no. 1 reporting for duty
MPI job 0 , thread no. 0 reporting for duty
MPI job 2 , thread no. 0 reporting for duty
MPI job 8 , thread no. 1 reporting for duty
MPI job 4 , thread no. 1 reporting for duty
MPI job 4 , thread no. 0 reporting for duty
MPI job 10 , thread no. 1 reporting for duty
MPI job 10 , thread no. 0 reporting for duty
MPI job 3 , thread no. 1 reporting for duty
MPI job 3 , thread no. 0 reporting for duty
MPI job 1 , thread no. 0 reporting for duty
MPI job 1 , thread no. 1 reporting for duty
MPI job 5 , thread no. 0 reporting for duty
MPI job 5 , thread no. 1 reporting for duty
MPI job 9 , thread no. 1 reporting for duty
MPI job 9 , thread no. 0 reporting for duty
MPI job 7 , thread no. 0 reporting for duty
MPI job 7 , thread no. 1 reporting for duty
MPI job 6 , thread no. 1 reporting for duty
MPI job 6 , thread no. 0 reporting for duty
Here's the code:
program basis
use mpi
implicit none
integer :: ierr, size, rank
integer :: comm = MPI_COMM_WORLD
call MPI_Init(ierr)
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
if (rank == 0) then
write(*,*) 'Running ', size, ' MPI jobs at the same time'
end if
call mpi_job(rank)
call MPI_Finalize(ierr)
contains
subroutine mpi_job(n) !this is mpi subroutine
use omp_lib
implicit none
integer :: n, ithread
!$omp parallel default(none) private(ithread) shared(n)
ithread = omp_get_thread_num()
write(*,*) 'MPI job ', n, ', thread no. ', ithread, ' reporting for duty'
!$omp end parallel
end subroutine mpi_job
end program basis
Upvotes: 1
Reputation: 22670
You use should use sub-communicators.
job_nr = floor(global_rank / ranks_per_job)
MPI_COMM_SPLIT
over the job_nr
. This creates a local to be used communicator for each jobmpi_job
. All communication then should use that communicator and the rank local to that communicator.Of course, this all implies that there is no dependencies between the different calls to mpi_job - or that you map that to appropriate global/world communicator.
Upvotes: 1
Reputation: 1527
So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads.
I wouldn't do that. I recommend mapping MPI processes to NUMA nodes and then spawning k threads where there are k cores per NUMA node.
There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores.
That's an entirely different issue. What aspect of mpi_job
won't scale well? Is it memory bound? Does it require too much communication?
Upvotes: 1