Reputation: 6193

How to control parallel tasks in Linux to avoid too much context switch

Now I'm using Linux to perform the following task:

while read parameter
do
    ./program_a $parameter $parameter.log 2>&1 &
done < parameter_file

Each parameter refers to the name of the file to be processed. Each file contains a different number of lines to process.

For example:
Parameter file contains:

File_A
File_B
File_C

File_A contains 1k lines, File_B contains 10k lines and File_C contains 1000k lines, which means that in the above script program_a simultaneously processes 1000 lines, 10k lines and 1000k lines respectively. The processing time for each task is almost linearly dependent on the number of lines and each task is independent.

I have 6 cores CPU with 12 threads. Because processing time could vary so that after running tasks for File_A and File_B, only one core will process the task for File_C. This is wasting resources.

I want to split each file to 1k lines and run them simultaneously. But for this example there will be 1011 tasks running (1k for each task). I think this will lead to a serious overly context switch problem. Maybe I can tune to number in each line to solve this problem, but I don't think this is a good solution.

My thought is to limit the tasks running will be always 6 tasks which means always using maximum number of cores to run and reduce context switches to as few as possible. But I don't know how to modify my script to achieve this goal. Anyone can give me some advice?

Upvotes: 5

Answers (4)

Ole Tange

Reputation: 33740

I assume program_a can read a single file.

Then this should work using GNU Parallel:

parallel --pipepart --block 10k --cat program_a :::: File_A File_B File_C

Adjust the 10k to be the size of your 1000 lines.

It does much of the same as @Marcus Rickert's answer, but hides the complexity from you and cleans up temporary files.

If program_a can read from a fifo, this should be faster:

parallel --pipepart --block 10k --fifo program_a :::: File_A File_B File_C

If program_a can read from stdin, it will be shorter:

parallel --pipepart --block 10k program_a :::: File_A File_B File_C

If you really must have excactly 1000 arguments try:

cat File_A File_B File_C | parallel --pipe -L1000 -N1 --cat program_a

or:

cat File_A File_B File_C | parallel --pipe -L1000 -N1 program_a

Upvotes: 0

Armali

Reputation: 19395

I also think I can use wait to archive the goal.

Indeed, you can achieve the goal with wait, even though bash's wait unfortunately waits for each process of a specified set, not for any one (that is, we can't simply instruct bash to wait for the earliest finishing process of all running), but since

The processing time for each task is almost linearly dependent on the number of line

and

I want to split each file to 1k lines

we can, to good approximation, say that the process started first also finishes first.

I assume you already have implemented the splitting of the files into 1000-line pieces (I can add that detail if desired) and their names are stored in the variable $files, in your example File_A000 File_B000 … File_B009 File_C000 … File_C999.

set --                                  # tasks stored in $1..$6
for file in $files
do  [ $# -lt 6 ] || { wait $1; shift; } # wait for and remove oldest task if 6
    ./program_a $file $file.log 2>&1 &
    set -- $* $!                        # store new task last
done
wait                                    # wait for the final tasks to finish

Upvotes: 0

Joshua Goldberg

Reputation: 5333

I wouldn't try to reinvent the load-balancing wheel by splitting the files. Use gnu parallel to handle the management of the tasks of different scales. It has plenty of options for parallel execution on one or multiple machines. If you set it up to, say, allow 4 processes in parallel, it will do that, starting a new task when a shorter one completes.

https://www.gnu.org/software/parallel/

https://www.gnu.org/software/parallel/parallel_tutorial.html

Here's a simple example using cat as a standin for ./program:

...write a couple of files
% cat > a
a
b
c

% cat > b
a  
b
c
d

% cat > files
a
b

... run the tasks
% parallel cat {1} \> {1}.log < files

% more b.log
a
b
c
d

Upvotes: 1

Marcus Rickert

Reputation: 4238

Since you are allowed to split files I assume that you are also allowed to combine files. In this case you could consider a fast preprocessing step as follows:

#! /bin/bash
# set the number of parallel threads
CPU=6

rm -f complete.out
# combine all files into one
while read parameter
do
    cat $parameter >> complete.out
done < parameter_file

# count the number of lines
lines=$(wc -l complete.out|cut -d " " -f 1)
lines_per_file=$(( $lines / $CPU + 1 ))

# split the big file into equal pieces named xa*
rm -f xa*
split --lines $lines_per_file complete.out 

# create a parameter file to mimic the old calling behaviour
rm -f new_parameter_file
for splinter in xa* ; do
    echo $splinter >> new_parameter_file
done

# this is the old call with just 'parameter_file' replaced by 'new_parameter_file'
while read parameter
do
    ./program_a $parameter $parameter.log 2>&1 &
done < new_parameter_file

Notes:

The file name pattern xa* of the generated files may be different in your setup.
Make sure that the last line of each file actually has a CR/LF!

Upvotes: 0

How to control parallel tasks in Linux to avoid too much context switch

Answers (4)

Related Questions