pmdaly
pmdaly

Reputation: 1212

How to execute thousands of commands in parallel using xargs?

I'm queuing up a bunch of jobs through qsub in a loop currently

for fn in $FNS; do
    queue_job $(options_a $fn) $(options_b $fn)
done

queue_job is a script that queues up jobs using qsub and options_a/b are functions I wrote that add a few job options based on filename. I queue up to 5k jobs this way and I'd like to just add them all to the queue instantly (or in larger blocks such as 40/time) instead of in a loop.

I know I can send lines to xargs and execute them in parallel as

??? | xargs -P 40 -I{} command {}

but I'm not sure how to translate my for loop to xargs

Upvotes: 3

Views: 1051

Answers (3)

Ole Tange
Ole Tange

Reputation: 33685

Using GNU Parallel it looks like this:

export -f options_a
export -f options_b

parallel -j40 'queue_job $(options_a {}) $(options_b {})' ::: $FNS

Upvotes: 0

jxh
jxh

Reputation: 70392

xargs is not needed.

If you background the task, the next task can be taken up immediately. You can add some intelligence to your script so that it caps itself on the number of simultaneous tasks. For example:

COUNT=1
LIMIT=40
for fn in $FNS; do
    queue_job $(options_a $fn) $(options_b $fn) &
    if [ $COUNT -lt $LIMIT ] ; then
        COUNT=$[COUNT+1]
        continue
    fi
    wait -n
done
wait

The queue_job command is placed in the background. The if body continues to spawn parallel queue_job tasks until COUNT reaches the LIMIT. If COUNT has reached LIMIT, then the loop waits for one of the running tasks to complete before spawning the next task. The trailing wait lets the script block until all the tasks have completed.

I tested this by simulating queue_job with a 2 second sleep, 30 tasks, and limiting to 10 parallel tasks. As expected, the simulation completed after about 6 seconds.

Try it online!

Upvotes: 0

dash-o
dash-o

Reputation: 14442

The qsub interface allows for submitting one job at a time - it does not provide bulk submission, which will limit the upside of submitting jobs in parallel (job submission is usually fast).

For the specific case, there are two (bash) functions (namely, options_a and options_b), which will expand to job specific parameters, based on the filename. This may limit direct execution with xargs, as suggested by the comments - the functions are unlikely to be available in the path.

Options:

Create a wrapper for queue_job that will source (or include) the functions. Use the wrapper from xargs

xargs -P40 -I{} queue_job_x1 '{}'
queue_job_x1

#! /bin/bash
function options_a {
   ...
}

function option_b {
   ...
}

queue_job $(options_a $fn) $(options_b $fn)'

Might be a good idea to put relevant functions into .sh file, which can be sourced by multiple scripts.

Upvotes: 2

Related Questions