Reputation: 23
I have written an MPI based C-code that I use to perform numerical simulations in parallel. Due to some poor design on my part, I have built in some inherent MPI dependencies into the code (array structures, MPI-IO). This means that if I want to run my code in serial, I have to invoke
mpiexec -n 1 c_exe
Main problem I use my C code within a Python workflow that is simplified in the loop below.
import os
import subprocess
homedir = os.getenv('PBS_O_WORKDIR')
nevents = 100
for ievent in range(nevents):
perform_workflow_management()
os.chdir(ievent)
subprocess.call('mpiexec -n 1 c_exe', Shell=True)
os.chdir(homedir)
The Python workflow is primarily for management and makes calls to the C code which performs the numerically intensive work.
The tasks within the Python for loop are independent, consequently I would like to employ an embarrassingly parallel scheme to parallelize the loop over events. Benchmarks indicate that parallelizing the loop over events will be faster than a serial loop with parallel MPI calls. Furthermore, I am running this on a PBS-Torque cluster.
I am at a loss about how to do this effectively. The complication seems to arise due to MPI call to my C code and the assignment of multiple MPI tasks.
Things I have tried in some form
Wrappers to pbsdsh - incur problems with processor assignment.
MPMD with mpiexec - Theoretically does what I would like but fails because all processes seem to share MPI_COMM_WORLD. My C code establishes a cartesian topology for domain based parallelism; conflicts arise here.
Does anyone have suggestions on how I might achieve deploy this in an embarrassingly parallel fashion? Ideally I would like to submit a job request
qsub -l nodes=N:ppn=1,walltime=XX:XX:XX go_python_job.bash
where N is the number of processors. On each process I would then like to be able to submit independent mpiexec calls to my C code.
I'm aware that part of the issue is down to design flaws but if I could find a solution without having to refactor large parts of code that would be advantageous.
Upvotes: 1
Views: 1052
Reputation: 74395
First of all, with any decent MPI implementation you don't have to use mpiexec
to start a singleton MPI job - simply run the executable (MPI standard, §10.5.2 Singleton MPI_INIT). It works at least with Open MPI and the MPICH family.
Second, any decent DRM (distributed resource manager, a.k.a. batch queueing system) supports array jobs. Those are the DRM equivalent of SPMD - multiple jobs with the same job file.
To get an array job with Torque, pass qsub
the -t from-to
option, either on the command line or in the job script:
#PBS -t 1-100
...
Then, in your Python script obtain the value of the PBS_ARRAYID
environment variable and use it to differentiate between the different instances:
import os
import subprocess
homedir = os.getenv('PBS_O_WORKDIR')
ievent = os.getenv('PBS_ARRAYID')
perform_workflow_management()
os.chdir(ievent)
subprocess.call('./c_exe', Shell=True)
os.chdir(homedir)
As already mentioned by @Zulan, this allows the job scheduler to better exploit the resources of the cluster via backfilling (if your Torque instance is paired with Maui or similar advanced scheduler).
The advantage of array jobs is that although from your perspective they look and work (mostly) like a single job the scheduler still sees them as separate jobs and schedule them individually.
One possible disadvantage of this approach is that if jobs are scheduled exclusively, i.e. no two jobs could share a single compute node, then the utilisation will be quite low unless your cluster nodes have just one single-core CPU each (very unlikely nowadays).
Upvotes: 2