Reputation: 6193
Now I'm using Linux to perform the following task:
while read parameter
do
./program_a $parameter $parameter.log 2>&1 &
done < parameter_file
Each parameter refers to the name of the file to be processed. Each file contains a different number of lines to process.
For example:
Parameter file contains:
File_A
File_B
File_C
File_A contains 1k lines, File_B contains 10k lines and File_C contains 1000k lines, which means that in the above script program_a simultaneously processes 1000 lines, 10k lines and 1000k lines respectively. The processing time for each task is almost linearly dependent on the number of lines and each task is independent.
I have 6 cores CPU with 12 threads. Because processing time could vary so that after running tasks for File_A and File_B, only one core will process the task for File_C. This is wasting resources.
I want to split each file to 1k lines and run them simultaneously. But for this example there will be 1011 tasks running (1k for each task). I think this will lead to a serious overly context switch problem. Maybe I can tune to number in each line to solve this problem, but I don't think this is a good solution.
My thought is to limit the tasks running will be always 6 tasks which means always using maximum number of cores to run and reduce context switches to as few as possible. But I don't know how to modify my script to achieve this goal. Anyone can give me some advice?
Upvotes: 5
Views: 753
Reputation: 33740
I assume program_a
can read a single file.
Then this should work using GNU Parallel:
parallel --pipepart --block 10k --cat program_a :::: File_A File_B File_C
Adjust the 10k
to be the size of your 1000 lines.
It does much of the same as @Marcus Rickert's answer, but hides the complexity from you and cleans up temporary files.
If program_a
can read from a fifo, this should be faster:
parallel --pipepart --block 10k --fifo program_a :::: File_A File_B File_C
If program_a
can read from stdin, it will be shorter:
parallel --pipepart --block 10k program_a :::: File_A File_B File_C
If you really must have excactly 1000 arguments try:
cat File_A File_B File_C | parallel --pipe -L1000 -N1 --cat program_a
or:
cat File_A File_B File_C | parallel --pipe -L1000 -N1 program_a
Upvotes: 0
Reputation: 19395
I also think I can use wait to archive the goal.
Indeed, you can achieve the goal with wait
, even though bash
's wait
unfortunately waits for each process of a specified set, not for any one (that is, we can't simply instruct bash
to wait for the earliest finishing process of all running), but since
The processing time for each task is almost linearly dependent on the number of line
and
I want to split each file to 1k lines
we can, to good approximation, say that the process started first also finishes first.
I assume you already have implemented the splitting of the files into 1000-line pieces (I can add that detail if desired) and their names are stored in the variable $files
, in your example File_A000 File_B000 … File_B009 File_C000 … File_C999
.
set -- # tasks stored in $1..$6
for file in $files
do [ $# -lt 6 ] || { wait $1; shift; } # wait for and remove oldest task if 6
./program_a $file $file.log 2>&1 &
set -- $* $! # store new task last
done
wait # wait for the final tasks to finish
Upvotes: 0
Reputation: 5333
I wouldn't try to reinvent the load-balancing wheel by splitting the files. Use gnu parallel to handle the management of the tasks of different scales. It has plenty of options for parallel execution on one or multiple machines. If you set it up to, say, allow 4 processes in parallel, it will do that, starting a new task when a shorter one completes.
https://www.gnu.org/software/parallel/
https://www.gnu.org/software/parallel/parallel_tutorial.html
Here's a simple example using cat as a standin for ./program:
...write a couple of files
% cat > a
a
b
c
% cat > b
a
b
c
d
% cat > files
a
b
... run the tasks
% parallel cat {1} \> {1}.log < files
% more b.log
a
b
c
d
Upvotes: 1
Reputation: 4238
Since you are allowed to split files I assume that you are also allowed to combine files. In this case you could consider a fast preprocessing step as follows:
#! /bin/bash
# set the number of parallel threads
CPU=6
rm -f complete.out
# combine all files into one
while read parameter
do
cat $parameter >> complete.out
done < parameter_file
# count the number of lines
lines=$(wc -l complete.out|cut -d " " -f 1)
lines_per_file=$(( $lines / $CPU + 1 ))
# split the big file into equal pieces named xa*
rm -f xa*
split --lines $lines_per_file complete.out
# create a parameter file to mimic the old calling behaviour
rm -f new_parameter_file
for splinter in xa* ; do
echo $splinter >> new_parameter_file
done
# this is the old call with just 'parameter_file' replaced by 'new_parameter_file'
while read parameter
do
./program_a $parameter $parameter.log 2>&1 &
done < new_parameter_file
Notes:
xa*
of the generated files may be different in your setup.Upvotes: 0