Reputation: 1699

How to stream one file to multiple pipelines efficiently

I have a script that wants to run several programs / pipelines over a very large file. Example:

grep "ABC" file > file.filt
md5sum file > file.md5

The kernel will try to cache the file in RAM, so if it is read again soon it may be a copy from RAM. However the files are large, and the programs run at wildly different speeds, so this is unlikely to be effective. To minimise IO usage I want to read the file once.

I know of 2 ways to duplicate the data using tee and moreutils' pee:

<file tee >(md5sum > file.md5) | grep "ABC" > file.filt
<file pee 'md5sum > file.md5' 'grep "ABC" > file.filt'

Is there another 'best' way? Which method will make the fewest copies? Does it make a difference which program is >() or |-ed to? Will any of these approaches attempt to buffer data in RAM if one program is too slow? How do they scale to many reader programs?

Upvotes: 3

Answers (2)

Evan Benn

Reputation: 1699

tee (command) opens each file using fopen, but sets _IONBF (unbuffered) on each. It reads from stdin, and fwrites to each FILE*.

pee (command) popens each command, sets each to unbuffered, reads from stdin, and fwrites to each FILE*. popen uses pipe(2), which has a capacity of 65536 bytes. Writes to a full buffer will block. pee also uses /bin/sh to interpret the command, but I think that will not add any buffering/copying.

mkfifo (command) uses mkfifo (libc), which use pipes underneath, opening the file/pipe blocks until the other end is opened.

bash <>() syntax (subst.c:5712) uses either pipe or mkfifo. pipe if /dev/fds are supported. It does not use the c fopen calls so does not set the buffering.

So all three variants (pee, tee >(), mkfifo ...) should end up with identical behaviour, reading from stdin and writing to pipes without buffering. The data is duplicated at each read (from kernel to user), and then again at each write (user back to kernel), I think tees fwrites will not cause an extra layer of copying (as there is no buffer). Memory usage could increase to a maximum of 65536 * num_readers + 1 * read_size (if no one is reading). tee writes to stdout first, then each file/pipe in order.

Given this pee just works around other shells (fish!) lack of >() operator equivalent, there seems to be no need for it with bash. I prefer tee when you have bash, but pee is nice when you don't. The bash <() is not replaced by pee of course. Manually mkfifoing and redirecting is tricky and unlikely to deal with errors nicely.

pee could probably be changed by implementing using the tee library function (instead of fwrite). I think this would cause the input to be read at the speed of the fastest reader, and potentially fill up the kernel buffers.

Upvotes: 1

Poshi

Reputation: 5762

AFAIK, there is no "best way" to achieve this. But I can give you another approach, more verbose, not a one liner, but maybe clearer because each command is written in its own. Use named pipes:

mkfifo tmp1 tmp2
tee tmp1 > tmp2 < file &
cat tmp1 | md5sum > file.md5 &
cat tmp2 | grep "ABC" > file.filt &
wait
rm tmp1 tmp2

Create as many name pipes as commands to be run.
tee to the named pipes the input file (tee outputs its input in the standard output, so the last name pipe must be a redirection), let it run in background.
Use the different named pipes as input to the different commands to run. Let them run in the background.
Finally, wait for the jobs to finish and remove the temporary named pipes.

The drawback of this approach, when the programs have gread variability in theirs speeds is that all of them will read the files at the same pace (limit is the buffer size, once full for one of the pipes, the others will have to wait too), so if one of them is resource-hungry (like memory hungry), the resources will be used for the whole lifespan of all the processes.

Upvotes: 0

How to stream one file to multiple pipelines efficiently

Answers (2)

Related Questions