brainchild
brainchild

Reputation: 796

multiprocessing for bash loop

I have a non-trivial Bash script taking roughly the following form:

# Initialization

<generate_data> | while read line; do

    # Run tests and filters on line

    if [ "$tests_pass" ]; then
        echo "$filtered_line"
    fi

done | sort <sort_option> | <consume_data>

# Finalization

Compared to the filter, the generator consumes minimal processing resources, and, of course, the sort operation cannot begin until all filtered data is available. As such, the filter, a cascade of several loops and conditionals written natively in Bash, is the processing bottleneck, and the single process running this loop consumes an entire core.

A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.

Is any convenient tool or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.

Upvotes: 2

Views: 923

Answers (2)

Ole Tange
Ole Tange

Reputation: 33748

A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.

You will rarely see bash scripts that do not invoke external commands. You even use sort in your pipe, and sort is an external command.

Is any convenient tool ...

Without your definition of 'convenient tool' that is impossible to answer. I would personally find parallel --pipe cmd convenient, but maybe it does not fit your definition.

... or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.

There is no Bash builtin. It is the primary reason why GNU Parallel has the --pipe option.

Using | parallel --pipe myfilter | seems to fit quite well with the overall structure of the script.

Upvotes: 0

Ole Tange
Ole Tange

Reputation: 33748

The issue with invoking an external command is the lack of code manageability with respect to moving the filter logic into some command that can be called independently.

If that is the reason for not using GNU Parallel, it sounds as if you are not aware of parallel --embed.

--embed is made exactly because people have a need to have GNU Parallel in the same file as the rest of the code.

[output from parallel --embed]

myfilter() {
    while read line; do
      # Run tests and filters on line
      if [ "$tests_pass" ]; then
        echo "$filtered_line"
      fi
    done
}   
export -f myfilter

<generate_data> | parallel --pipe myfilter | sort <sort_option> | <consume_data>

The resulting script will run even if GNU Parallel is not installed.

Upvotes: 3

Related Questions