Reputation: 171
I have a MPI fortran application using MPICH that can be launched/run without problem if I use:
mpiexec -n 16 -f $PBS_NODEFILE $PBS_O_WORKDIR/myMODEL.a
In the above example I am asking 2 nodes, once each node on the cluster has 8 cpu.
The problem is that my /home are NFS mounted on the compute nodes through the head node and i/o to these disks is very slow. Furthermore, my application has a lot of i/o and from experience, excessive i/o to NFS mounted disk to head node can lock up the head node (this is bad), and it can become completely unresponsive.
The cluster system has a disk that is locally mounted for each JOB on each node (I can use the environmental variable TMPDIR to reach this directory) and so my job need to run under this disk. Knowing this, my strategy is very simple:
If I do all steps above, asking for the cluster system (PBS/Torque) just one node, there is no problem.
#!/bin/csh
#PBS -N TESTE
#PBS -o stdout_file.out
#PBS -e stderr_file.err
#PBS -l walltime=00:01:00
#PBS -q debug
#PBS -l mem=512mb
#PBS -l nodes=1:ppn=8
set NCPU = `wc -l < $PBS_NODEFILE`
set NNODES = `uniq $PBS_NODEFILE | wc -l`
cd $TMPDIR
cp $PBS_O_WORKDIR/myMODEL.a ./myMODEL.a
mpiexec -n $NCPU -f $PBS_NODEFILE ./myMODEL.a
But if I ask more then one node
#!/bin/csh
#PBS -N TESTE
#PBS -o stdout_file.out
#PBS -e stderr_file.err
#PBS -l walltime=00:01:00
#PBS -q debug
#PBS -l mem=512mb
#PBS -l nodes=2:ppn=8
set NCPU = `wc -l < $PBS_NODEFILE`
set NNODES = `uniq $PBS_NODEFILE | wc -l`
cd $TMPDIR
cp $PBS_O_WORKDIR/myMODEL.a ./myMODEL.a
mpiexec -n $NCPU -f $PBS_NODEFILE ./myMODEL.a
I got the following error:
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)
[proxy:0:[email protected]] HYD_pmcd_pmip_control_cmd_cb (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:[email protected]] HYDT_dmxu_poll_wait_for_event (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:[email protected]] main (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event
[[email protected]] HYDT_bscu_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[[email protected]] HYDT_bsci_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[[email protected]] HYD_pmci_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
[[email protected]] main (/tmp/mvapich2-1.8.1/src/pm/hydra/ui/mpich/mpiexec.c:405): process manager error waiting for completion
What am I doing wrong?
Upvotes: 0
Views: 905
Reputation: 745
Looks like when mvapich is starting the processes on the second node it is not finding your executable. Try adding the following before your mpiexec to copy your executable and anything else you need to the node scratch directories. I'm not a csh user so you may be able to do this better.
foreach n ( `uniq $PBS_NODEFILE` )
scp $PBS_O_WORKDIR/myMODEL.a $n:$TMPDIR
end
Upvotes: 3