Reputation: 3425
I have sets of jobs and all of the jobs can be run in parallel so I want to parallelize them for better throughput.
This is what I am currently doing: I wrote a python script using multiprocessing library that runs jobs in a set at the same time. After all of the jobs in a set is finished, then another set of jobs (script) will be invoked. It is inefficient because each of job in a set has different execution time.
Recently, I noticed about GNU parallel and I think it may help to improve my script. However, a set of jobs have some pre-processing and post-processing tasks thus it is impossible to run random job.
In summary, I want to 1) make sure that pre-processing is completed before launching a job and 2) run post-processing after the jobs in a set are all completed.
And this is what I am trying to do:
I am wondering how can I do such thing with GNU parallel or even not sure that GNU parallel is a write tool for this.
Upvotes: 2
Views: 1124
Reputation: 33740
If we assume you are limited by CPU (and not mem or I/O) then this might work:
do_jobset() {
jobset=$1
preprocess $jobset
parallel --load 100% do_job ::: $jobset/*
postprocess $jobset
}
export -f do_jobset
parallel do_jobset ::: *.jobset
If do_job
does not use a full CPU from the start, but takes 10 seconds to load data to be processed, add --delay 10
before --load 100%
.
The alternative is to do:
parallel preprocess ::: *.jobset
parallel do_job ::: jobsets*/*
parallel postprocess ::: *.jobset
Upvotes: 2