Jack Walpole
Jack Walpole

Reputation: 31

How to know when PBS batch jobs are complete

I have a BASH script that submits multiple serial jobs to the PBS queueing system. Once the jobs are submitted the script ends. The jobs then run on a cluster and when they are all finished I can move on to the next step. A typical workflow might involve several of these steps.

My question:

Is there a way for my script not to exit upon completion of the submission, but rather to sleep until ALL jobs submitted by that script have completed on the cluster, only then exiting?

Upvotes: 3

Views: 2168

Answers (3)

MountainDrew
MountainDrew

Reputation: 473

To actually check if a job is done, we need to use qstat and the job ID to get the job status and then grep the status for a status code. As long as your username or job name are not "C", the following should work:

#!/bin/bash

# SECTION 1: Launch all jobs and store their job IDs in a variable

myJobs="job1.qsub job2.qsub job3.qsub" # Your job names here
numJobs=$(echo "$myJobs" | wc -w)      # Count the jobs
myJobIDs=""                            # Initialize an empty list of job IDs
for job in $myJobs; do
    jobID_full=$(qsub $job)
    # jobID_full will look like "12345.machinename", so use sed
    # to get just the numbers
    jobID=$(echo "$jobID_full" | sed -e 's|\([0-9]*\).*|\1|')
    myJobIDs="$myJobIDs $jobID"        # Add this job ID to our list
done

# SECTION 2: Check the status of each job, and exit while loop only
# if they are all complete

numDone=0                              # Initialize so that loop starts
while [ $numDone -lt $numJobs ]; do    # Less-than operator
    numDone=0                          # Zero since we will re-count each time
    for jobID in $myJobIDs; do         # Loop through each job ID

        # The following if-statement ONLY works if qstat won't return
        # the string ' C ' (a C surrounded by two spaces) in any
        # situation besides a completed job.  I.e. if your username
        # or jobname is 'C' then this won't work!
        # Could add a check for error (grep -q ' E ') too if desired
        if qstat $jobID | grep -q ' C ' 
        then
            (( numDone++ ))
        else
            echo $numDone jobs completed out of $numJobs
            sleep 1
        fi
    done
done

echo all jobs complete

Upvotes: 0

dbeer
dbeer

Reputation: 7213

You are trying to establish a workflow, correct? The best way to do what you're attempting to accomplish would be to use job dependencies. Essentially, what you are trying to do is submit X number of jobs, and then submit more jobs that depend on the first set of jobs, and you can do this with job dependencies. There are different ways to do dependencies that you can read about in the previous link, but here's an example of submitting 3 jobs and then submitting 3 more that won't execute until after the first 3 have exited.

#first batch
jobid1=`qsub ...`
jobid2=`qsub ...`
jobid3=`qsub ...`

#next batch
depend_str="-W after:${jobid1} -W after:${jobid2} -W after:${jobid3}"
qsub ... $depend_str
qsub ... $depend_str
qsub ... $depend_str

Upvotes: 1

264nm
264nm

Reputation: 755

One way to do this would be using GNU Parallel command 'sem'

I learnt about this doing queue stuff as well. It acts as a timer allowing commands to be executed after exiting etc.

Edit: I know the example here is really basic but there is a lot that can be achieved running tasks using parallel --sem or even just parallel in general. Have a look at the tutorial, I'm certain you will be able to find a relevant example that will help.

There is a great tutorial here

An example from a tutorial:

  sem 'sleep 1; echo The first finished' &&
    echo The first is now running in the background &&
    sem 'sleep 1; echo The second finished' &&
    echo The second is now running in the background
  sem --wait

Output:

The first is now running in the background

The first finished

The second is now running in the background

The second finished

See Man Page

Upvotes: 0

Related Questions