tobiuchiha
tobiuchiha

Reputation: 73

Unable to run PBS script on multiple nodes using GNU parallel

I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.

#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm

sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat

export JOBS_PER_NODE=4  

PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}

input_files.dat contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat and sort -u $PBS_NODEFILE > nodelist.dat from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922 and 901 are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE) file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile argument, so I am not sure if this is a system specific problem.

Edit:

As @Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.

# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat

# changes the contents of the nodelist.dat from "900" to "[email protected]"
sed -i -r "s/([0-9]+)/username@w-\1.cluster.uni.edu/g" nodelist.dat

I verified that the nodelist.dat contains only one line viz., [email protected].

Edit-2:

It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."

Upvotes: 2

Views: 956

Answers (1)

Ole Tange
Ole Tange

Reputation: 33685

The variable $PARALLEL is used by GNU Parallel for options. So when you also use it, it is likely to cause confusion. It does not seem to be the root cause here, though, but do yourself a favor and use another variable name (or use it as described in the man page).

The problem here seems to be ssh which will not see a number as a hostname:

$ ssh 8
ssh: connect to host 8 port 22: Invalid argument

Add the domain name, and ssh will see it as a hostname:

$ ssh 8.pi.dk
<<connects>>

If I were you I would talk to your cluster admin and ask if the worker nodes could be renamed to w-XXX, where XXX is their current name.

Upvotes: 1

Related Questions