Reputation: 2044

having a job run only after all my previous jobs have finished

I found a post indicating how I might tell bsub to wait for a specified set of jobs to finish before running here, however this only works if one knows the number of jobs before hand.

I would like to run an arbitrary number of jobs, and run a "wrapping up" job after all my jobs have finished

here is my script:

#!/bin/bash
for file in dir/*; do # I don't know how many jobs will be created
    bsub "./do_it_once.sh $file"
done

bsub -w "done(1) && done(2) && done(3)" merge_results.sh

The above script will work when there are 3 jobs submitted, but what if there are n jobs? how can I specify that I want to wait for all the jobs to finish?

Upvotes: 2

Answers (4)

Shynamo

Reputation: 1

Since the output of bjobs is 1 line (No unfinished job found) when no job is pending/running, and 2 lines when there is at least 1 jobs pending/running:

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
25156   awesome RUN   best_queue superhost   30*host     cool_name  Jun 16 05:38

You can loop on bjobs | wc -l using:

for job in $some_jobs; 
    bsub < $job

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

One benefit of this technique is that you can launch multiple jobs regardless on how many you need to run. Just loop on them before waiting. This is clearly not the cleanest way to do it but it works at the moment.

for some_jobs in $job_groups; do \
    for job in $some_jobs; do \
        bsub < $job
    done

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

Upvotes: 0

kmace

Reputation: 2044

Based off of cxw's reply, I got something working. It doesn't use arrays. However, the -w command can take wildcards, so I named each job similarly.

Still not sure if this is the correct way to call bsub, since you need to call it once every time, but it works.

edited from cxw:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs${jobnum}" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs*)" merge_results.sh

Upvotes: 1

Orsiris de Jong

Reputation: 3016

Here's my full solution which adds time control and gives the number of failed jobs. Also takes care to kill children of failed jobs if needed, and deals with zombie or uninterruptible processes:

function Logger {
    echo "$1"
}

# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
    local pid="${1}" # Parent pid to kill childs
    local self="${2:-false}" # Should parent be killed too ?


    if children="$(pgrep -P "$pid")"; then
            KillChilds "$child" true
        done
    fi
        # Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
    if ( [ "$self" == true ] && kill -0 $pid > /dev/null 2>&1); then
        kill -s TERM "$pid"
        if [ $? != 0 ]; then
            sleep 15
            Logger "Sending SIGTERM to process [$pid] failed."
            kill -9 "$pid"
            if [ $? != 0 ]; then
                Logger "Sending SIGKILL to process [$pid] failed."
                return 1
            fi
        else
            return 0
        fi
    else
        return 0
    fi
}

function WaitForTaskCompletion {
    local pids="${1}" # pids to wait for, separated by semi-colon
    local soft_max_time="${2}" # If program with pid $pid takes longer than $soft_max_time seconds, will log a warning, unless $soft_max_time equals 0.
    local hard_max_time="${3}" # If program with pid $pid takes longer than $hard_max_time seconds, will stop execution, unless $hard_max_time equals 0.
    local caller_name="${4}" # Who called this function
    local counting="${5:-true}" # Count time since function has been launched if true, since script has been launched if false
    local keep_logging="${6:-0}" # Log a standby message every X seconds. Set to zero to disable logging

    local soft_alert=false # Does a soft alert need to be triggered, if yes, send an alert once
    local log_ttime=0 # local time instance for comparaison

    local seconds_begin=$SECONDS # Seconds since the beginning of the script
    local exec_time=0 # Seconds since the beginning of this function

    local retval=0 # return value of monitored pid process
    local errorcount=0 # Number of pids that finished with errors

    local pid   # Current pid working on
    local pidCount # number of given pids
    local pidState # State of the process

    local pidsArray # Array of currently running pids
    local newPidsArray # New array of currently running pids

    IFS=';' read -a pidsArray <<< "$pids"
    pidCount=${#pidsArray[@]}

    WAIT_FOR_TASK_COMPLETION=""

    while [ ${#pidsArray[@]} -gt 0 ]; do
        newPidsArray=()

        Spinner
        if [ $counting == true ]; then
            exec_time=$(($SECONDS - $seconds_begin))
        else
            exec_time=$SECONDS
        fi

        if [ $keep_logging -ne 0 ]; then
            if [ $((($exec_time + 1) % $keep_logging)) -eq 0 ]; then
                if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1s
                    log_ttime=$exec_time
                fi
            fi
        fi

        if [ $exec_time -gt $soft_max_time ]; then
            if [ $soft_alert == true ] && [ $soft_max_time -ne 0 ]; then
                Logger "Max soft execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]."
                soft_alert=true
                SendAlert true

            fi
            if [ $exec_time -gt $hard_max_time ] && [ $hard_max_time -ne 0 ]; then
                Logger "Max hard execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution."
                for pid in "${pidsArray[@]}"; do
                    KillChilds $pid true
                    if [ $? == 0 ]; then
                        Logger "Task with pid [$pid] stopped successfully." "NOTICE"
                    else
                        Logger "Could not stop task with pid [$pid]." "ERROR"
                    fi
                done
                SendAlert true
                errrorcount=$((errorcount+1))
            fi
        fi

        for pid in "${pidsArray[@]}"; do
            if [ $(IsNumeric $pid) -eq 1 ]; then
                if kill -0 $pid > /dev/null 2>&1; then
                    # Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
                    #TODO(high): have this tested on *BSD, Mac & Win
                    pidState=$(ps -p$pid -o state= 2 > /dev/null)
                    if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
                        newPidsArray+=($pid)
                    fi
                else
                    # pid is dead, get it's exit code from wait command
                    wait $pid
                    retval=$?
                    if [ $retval -ne 0 ]; then
                        errorcount=$((errorcount+1))
                        Logger "${FUNCNAME[0]} called by [$caller_name] finished monitoring [$pid] with exitcode [$retval]. "DEBUG"
                        if [ "$WAIT_FOR_TASK_COMPLETION" == "" ]; then
                            WAIT_FOR_TASK_COMPLETION="$pid:$retval"
                        else
                            WAIT_FOR_TASK_COMPLETION=";$pid:$retval"
                        fi
                    fi
                fi

            fi
        done

        pidsArray=("${newPidsArray[@]}")
        # Trivial wait time for bash to not eat up all CPU
        sleep .05
    done

    # Return exit code if only one process was monitored, else return number of errors
    if [ $pidCount -eq 1 ] && [ $errorcount -eq 0 ]; then
        return $errorcount
    else
        return $errorcount
    fi
}

Usage:

Let's take 3 sleep jobs, get their pids and send them to WaitforTaskCompletion:

sleep 10 &
pids="$!"
sleep 15 &
pids="$pids;$!"
sleep 20 &
pids="$pids;$!"

WaitForTaskCompletion $pids 1800 3600 ${FUNCNAME[0]} false 1800

The prior example would warn you if execution takes more than 1 hour, stop execution after 2 hours, and send a "alive" log message every half hour.

Upvotes: 0

cxw

Reputation: 17051

Edit See kamula's answer for what actually works :) .

Original answer

Never used bsub, but from a quick trip through the man page, I think this might do it:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs[$jobnum]" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs[*])" merge_results.sh

The jobs are named with sequential indices in a bsub array called myjobs[], using bash variable jobnum. Then the last bsub waits for all of the myjobs[] jobs to finish.

YMMV!

Oh - also, you might need to use -J "\"myjobs[...]\"" (with \"). The man page says to wrap the job names in double-quotes, but I don't know if that's a bsub requirement or if they are assuming you will be using a shell that expands unquoted text.

Upvotes: 1

having a job run only after all my previous jobs have finished

Answers (4)

Original answer

Related Questions