Reputation: 2052

How to distribute executions over a cluster

I am doing research, and I often need to execute the same program with different inputs (each combination of inputs repeatedly) and store the results, for aggregation.

I would like to speed up the process by executing these experiments in parallel, over multiple machines. However, I would like to avoid the hassle of launching them manually. Furthermore, I would like my program to be implemented as a single thread and only add parallelization on top of it.

I work with Ubuntu machines, all reachable in the local network. I know GNU Parallel can solve this problem, but I am not familiar with it. Can someone help me to setup an environment for my experiments?

Upvotes: 0

Answers (1)

marcotama

Reputation: 2052

Please, notice that this answer has been adapted from one of my scripts and is untested. If you find bugs, you are welcome to edit the answer.

First of all, to make the process completely batch, we need a non-interactive SSH login (that's what GNU Parallel uses to launch commands remotely).

To do this, first generate a pair of RSA keys (if you don't already have one) with:

ssh-keygen -t rsa

which will generate a pair of private and public keys, stored by default in ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. It is important to use these locations, as openssh will go looking for them here. While openssh commands allow you to specify the private key file (passing it by -i PRIVATE_KEY_FILE_PATH), GNU Parallel does not have such an option.

Next, we need to copy the public key on all the remote machines we are going to use. For each of the machines of your cluster (I will call them "workers"), run on this command on your local machine:

ssh-copy-id -i ~/.ssh/id_rsa.pub WORKER_USER@WORKER_HOST

This step is interactive, as you will need to login to each of the workers through user id and password.

From this moment on, login from your client to each of the workers is non-interactive. Next, let's setup a bash variable with a comma-separated list of your workers. We will set this up using GNU Parallel special syntax, which allows to indicate how many CPUs to use on each worker:

WORKERS_PARALLEL="2/[email protected],[email protected],4/[email protected]"

Here, I specified that on 192.168.0.10 I want only 2 parallel processes, while on 10.0.111.69 I want for. As for 192.168.0.20, since I did not specify any number, GNU Parallel will figure out how many CPUs (CPU cores, actually) the remote machine has and execute that many parallel processes.

Since I will also need the same list in a format that openssh can understand, I will create a second variable without the CPU information and with spaces instead of commas. I do this automatically with:

WORKERS=`echo $WORKERS_PARALLEL | sed 's/[0-9]*\///g' | sed 's/,/ /g'`

Now it's time to setup my code. I assume that each of the workers is configured to run my code, so that I will just need to copy the code. On workers, I usually work in the /tmp folder, so what follows assumes that. The code will be copied though an SSH tunnel and extracted remotely:

WORKING_DIR=/tmp/myexperiments
TAR_PATH=/tmp/code.tar.gz
# Clean from previous executions
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
# Copy the the tar.gz file on the worker
parallel scp LOCAL_TAR_PATH {}:/tmp ::: `echo $WORKERS`
# Create the working directory on the worker
parallel --nonall -S $WORKERS mkdir -p $WORKING_DIR
# Extract the tar file in the working directory
parallel --nonall -S $WORKERS tar --warning=no-timestamp -xzf $TAR_PATH -C $WORKING_DIR

Notice that multiple executions on the same machine will use the same working directory. I assume only one version of the code will be run at a specific time; if this is not the case you will need to modify the commands to use different working directories. I use the --warning=no-timestamp directive to avoid annoying warnings that could be issued if the time of your machine ahead of that of your workers.

We now need to create directories in the local machine for storing the results of the runs, one for each group of experiments (that is, multiple executions with the same parameters). Here, I am using two dummy parameters alpha and beta:

GROUP_DIRS="results/alpha=1,beta=1 results/alpha=0.5,beta=1 results/alpha=0.2,beta=0.5"
N_GROUPS=3
parallel --header : mkdir -p {DIR} ::: DIR $GROUP_DIRS

Notice here that using parallel here is not necessary: using a loop would have worked, but I find this more readable. I also stored the number of groups, which we will use in the next step.

A final preparation step consists in creating a list of all the combinations of parameters that will be used in the experiments, each repeated as many times as necessary. Each repetition is coupled with an incremental number for identifying different runs.

ALPHAS="1.0 0.5 0.2"
BETAS="1.0 1.0 0.5"
REPETITIONS=1000
PARAMS_FILE=/tmp/params.txt
# Create header
echo REP GROUP_DIR ALPHA BETA > $PARAMS_FILE
# Populate
parallel \
  --header : \
  --xapply \
  if [ ! -e {GROUP_DIR}"exp"{REP}".dat" ]';' then echo {REP} {GROUP_DIR} {ALPHA} {BETA} '>>' $PARAMS_FILE ';' fi \
  ::: REP $(for i in `seq $REPETITIONS`; do printf $i" %.0s" $(seq $N_GROUPS) ; done) \
  ::: GROUP_DIR $GROUP_DIRS \
  ::: ALPHA $ALPHAS \
  ::: BETA $BETAS

In this step I also implemented a control: if a .dat file already exists, I skip that set of parameters. This is something that comes out of practice: I often interrupt the execution of GNU Parallel and later decide to resume it by re-executing these commands. With this simple control I avoid running more experiments than necessary.

Now we can finally run the experiments. The algorithm in this example generates a file as specified in the parameter --save-data which I want to retrieve. I also want to save the stdout and stderr in a file, for debugging purposes.

cat $PARAMS_FILE | parallel \
  --sshlogin $WORKERS_PARALLEL \
  --workdir $WORKING_DIR \
  --return {GROUP_DIR}"exp"{REP}".dat" \
  --return {GROUP_DIR}"exp"{REP}".txt" \
  --cleanup \
  --xapply \
  --header 1 \
  --colsep " " \
  mkdir -p {TEST_DIR} ';' \
 ./myExperiment \
  --random-seed {REP} \
  --alpha {ALPHA} \
  --beta {BETA} \
  --save-data {GROUP_DIR}"exp"{REP}".dat" \
  '&>' {GROUP_DIR}"exp"{REP}".txt"

A little bit of explanation about the parameters. --sshlogin, which could be abbreviated with -S, passes the list of workers that Parallel will use to distribute the computational load. --workdir sets the working dir of Parallel, which by default is ~. --return directives copy back the specified file after the execution is completed. --cleanup removes the files copied back. --xapply tells Parallel to interpret the parameters as tuples (rather than sets to multiply by cartesian product). --header 1 tells Parallel that the first line of the parameters file has to be interpreted as header (whose entries will be used as names for the columns). --colsep tells Parallel that columns in the parameters file are space-separated.

WARNING: Ubuntu's version of parallel is outdated (2013). In particular, there is a bug preventing the above code to run properly, which has been fixed only a few days ago. To get the latest monthly snapshot, run (does not need root privileges):

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

Notice that the fix to the bug I mentioned above will only be included in the next snapshot, on September 22nd, 2015. If you are in a hurry you should perform a manual installation of the smoking hottest .

Finally, it is a good habit to clean our working environments:

rm $PARAMS_FILE
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH

If you use this for reseach and publish a paper, remember to cite the original work by Ole Tange (see parallel --bibtex).

Upvotes: 1

How to distribute executions over a cluster

Answers (1)

Related Questions