Reputation: 73
I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.
#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm
sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat
export JOBS_PER_NODE=4
PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}
input_files.dat
contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat
and sort -u $PBS_NODEFILE > nodelist.dat
from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922
and 901
are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE)
file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile
argument, so I am not sure if this is a system specific problem.
Edit:
As @Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.
# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat
# changes the contents of the nodelist.dat from "900" to "[email protected]"
sed -i -r "s/([0-9]+)/username@w-\1.cluster.uni.edu/g" nodelist.dat
I verified that the nodelist.dat
contains only one line viz., [email protected]
.
Edit-2:
It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."
Upvotes: 2
Views: 956
Reputation: 33685
The variable $PARALLEL
is used by GNU Parallel for options. So when you also use it, it is likely to cause confusion. It does not seem to be the root cause here, though, but do yourself a favor and use another variable name (or use it as described in the man page).
The problem here seems to be ssh
which will not see a number as a hostname:
$ ssh 8
ssh: connect to host 8 port 22: Invalid argument
Add the domain name, and ssh
will see it as a hostname:
$ ssh 8.pi.dk
<<connects>>
If I were you I would talk to your cluster admin and ask if the worker nodes could be renamed to w-XXX, where XXX is their current name.
Upvotes: 1