Quim
Quim

Reputation: 171

MPICH stop running across more than one node

I have a MPI fortran application using MPICH that can be launched/run without problem if I use:

mpiexec -n 16 -f $PBS_NODEFILE   $PBS_O_WORKDIR/myMODEL.a

In the above example I am asking 2 nodes, once each node on the cluster has 8 cpu.

The problem is that my /home are NFS mounted on the compute nodes through the head node and i/o to these disks is very slow. Furthermore, my application has a lot of i/o and from experience, excessive i/o to NFS mounted disk to head node can lock up the head node (this is bad), and it can become completely unresponsive.

The cluster system has a disk that is locally mounted for each JOB on each node (I can use the environmental variable TMPDIR to reach this directory) and so my job need to run under this disk. Knowing this, my strategy is very simple:

  1. Move the files from /home to the $TMPDIR
  2. Start the simulation at $TMPDIR
  3. After the model stops, get the outputs from the application back to /home

If I do all steps above, asking for the cluster system (PBS/Torque) just one node, there is no problem.

 #!/bin/csh

 #PBS -N TESTE
 #PBS -o stdout_file.out
 #PBS -e stderr_file.err
 #PBS -l walltime=00:01:00
 #PBS -q debug
 #PBS -l mem=512mb
 #PBS -l nodes=1:ppn=8

 set NCPU        = `wc -l < $PBS_NODEFILE`
 set NNODES      = `uniq $PBS_NODEFILE | wc -l`

 cd $TMPDIR
 cp $PBS_O_WORKDIR/myMODEL.a ./myMODEL.a
 mpiexec -n $NCPU -f $PBS_NODEFILE   ./myMODEL.a

But if I ask more then one node

 #!/bin/csh

 #PBS -N TESTE
 #PBS -o stdout_file.out
 #PBS -e stderr_file.err
 #PBS -l walltime=00:01:00
 #PBS -q debug
 #PBS -l mem=512mb
 #PBS -l nodes=2:ppn=8

 set NCPU        = `wc -l < $PBS_NODEFILE`
 set NNODES      = `uniq $PBS_NODEFILE | wc -l`

 cd $TMPDIR
 cp $PBS_O_WORKDIR/myMODEL.a ./myMODEL.a
 mpiexec -n $NCPU -f $PBS_NODEFILE   ./myMODEL.a

I got the following error:

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYDU_create_process (/tmp/mvapich2-1.8.1/src/pm/hydra/utils/launch/launch.c:69): execvp error on file /state/partition1/74127.beach.colorado.edu/myMODEL.a (No such file or directory)

[proxy:0:[email protected]] HYD_pmcd_pmip_control_cmd_cb (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:955): assert (!closed) failed

[proxy:0:[email protected]] HYDT_dmxu_poll_wait_for_event (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status

[proxy:0:[email protected]] main (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event

[[email protected]] HYDT_bscu_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting

[[email protected]] HYDT_bsci_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion

[[email protected]] HYD_pmci_wait_for_completion (/tmp/mvapich2-1.8.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion

[[email protected]] main (/tmp/mvapich2-1.8.1/src/pm/hydra/ui/mpich/mpiexec.c:405): process manager error waiting for completion

What am I doing wrong?

Upvotes: 0

Views: 905

Answers (1)

chuck
chuck

Reputation: 745

Looks like when mvapich is starting the processes on the second node it is not finding your executable. Try adding the following before your mpiexec to copy your executable and anything else you need to the node scratch directories. I'm not a csh user so you may be able to do this better.

foreach n ( `uniq $PBS_NODEFILE` )
    scp $PBS_O_WORKDIR/myMODEL.a $n:$TMPDIR
end

Upvotes: 3

Related Questions