enc
enc

Reputation: 3425

gnu parallel as a job queue

I have sets of jobs and all of the jobs can be run in parallel so I want to parallelize them for better throughput.

This is what I am currently doing: I wrote a python script using multiprocessing library that runs jobs in a set at the same time. After all of the jobs in a set is finished, then another set of jobs (script) will be invoked. It is inefficient because each of job in a set has different execution time.

Recently, I noticed about GNU parallel and I think it may help to improve my script. However, a set of jobs have some pre-processing and post-processing tasks thus it is impossible to run random job.

In summary, I want to 1) make sure that pre-processing is completed before launching a job and 2) run post-processing after the jobs in a set are all completed.

And this is what I am trying to do:

  1. Run separate script for each set of job.
  2. Run pre-processing in script for each set and now it is free to run all jobs.
  3. Each script registers jobs into job queue in GNU parallel.
  4. GNU parallel runs job in a queue in parallel.
  5. Each script monitors their own job is finished or not.
  6. When all of the job in a set is done, run post-processing.

I am wondering how can I do such thing with GNU parallel or even not sure that GNU parallel is a write tool for this.

Upvotes: 2

Views: 1124

Answers (1)

Ole Tange
Ole Tange

Reputation: 33740

If we assume you are limited by CPU (and not mem or I/O) then this might work:

do_jobset() {
  jobset=$1
  preprocess $jobset
  parallel --load 100% do_job ::: $jobset/*
  postprocess  $jobset
}
export -f do_jobset
parallel do_jobset ::: *.jobset

If do_job does not use a full CPU from the start, but takes 10 seconds to load data to be processed, add --delay 10 before --load 100%.

The alternative is to do:

parallel preprocess ::: *.jobset
parallel do_job ::: jobsets*/*
parallel postprocess ::: *.jobset

Upvotes: 2

Related Questions