In bash, is it generally better to use process substitution or pipelines

Question

In the use-case of having the output of a singular command being consumed by only one other, is it better to use | (pipelines) or <() (process substitution)?

Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness.

The while read do done < <(cmd) benefits I already know about and have switched over to.

I have several var=$(cmd1|cmd2) instances that I suspect might be better replaced as var=$(cmd2 < <(cmd1)).

I would like to know what specific benefits the latter case brings over the former.

that other guy · Accepted Answer

tl;dr: Use pipes, unless you have a convincing reason not to.

Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.

There are three practical differences:

1. Bash defaults to creating a fork for every stage in a pipeline.

Which is why you started looking into this in the first place:

#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"

This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.

If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.

2. Bash does not wait for process substitutions to finish.

POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.

This makes a notable difference when you have a slow producer, like in a /dev/random password generator:

tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10     # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random)  # Fast?

The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.

Since bash waits for both head and tr to finish, it will appear seem slower.

In the procsub version, bash only waits for head, and lets tr finish in the background.

3. Bash does not currently optimize away forks for single simple commands in process substitutions.

If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.

Since forks are expensive, bash optimizes the cases that it can. For example, the command:

bash -c 'sleep 1'

Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.

( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.

$ strace -f bash -c '/bin/true | /bin/true'     2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)'  2>&1 | grep -c clone
3

Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.

And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

In bash, is it generally better to use process substitution or pipelines

Answers (1)

1. Bash defaults to creating a fork for every stage in a pipeline.

2. Bash does not wait for process substitutions to finish.

3. Bash does not currently optimize away forks for single simple commands in process substitutions.

Related Questions