Reputation: 796
I have a non-trivial Bash script taking roughly the following form:
# Initialization
<generate_data> | while read line; do
# Run tests and filters on line
if [ "$tests_pass" ]; then
echo "$filtered_line"
fi
done | sort <sort_option> | <consume_data>
# Finalization
Compared to the filter, the generator consumes minimal processing resources, and, of course, the sort operation cannot begin until all filtered data is available. As such, the filter, a cascade of several loops and conditionals written natively in Bash, is the processing bottleneck, and the single process running this loop consumes an entire core.
A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.
Is any convenient tool or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.
Upvotes: 2
Views: 923
Reputation: 33748
A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.
You will rarely see bash scripts that do not invoke external commands. You even use sort
in your pipe, and sort
is an external command.
Is any convenient tool ...
Without your definition of 'convenient tool' that is impossible to answer. I would personally find parallel --pipe cmd
convenient, but maybe it does not fit your definition.
... or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.
There is no Bash builtin. It is the primary reason why GNU Parallel has the --pipe
option.
Using | parallel --pipe myfilter |
seems to fit quite well with the overall structure of the script.
Upvotes: 0
Reputation: 33748
The issue with invoking an external command is the lack of code manageability with respect to moving the filter logic into some command that can be called independently.
If that is the reason for not using GNU Parallel, it sounds as if you are not aware of parallel --embed
.
--embed
is made exactly because people have a need to have GNU Parallel in the same file as the rest of the code.
[output from parallel --embed]
myfilter() {
while read line; do
# Run tests and filters on line
if [ "$tests_pass" ]; then
echo "$filtered_line"
fi
done
}
export -f myfilter
<generate_data> | parallel --pipe myfilter | sort <sort_option> | <consume_data>
The resulting script will run even if GNU Parallel is not installed.
Upvotes: 3