Reputation: 1699
I have a script that wants to run several programs / pipelines over a very large file. Example:
grep "ABC" file > file.filt
md5sum file > file.md5
The kernel will try to cache the file in RAM, so if it is read again soon it may be a copy from RAM. However the files are large, and the programs run at wildly different speeds, so this is unlikely to be effective. To minimise IO usage I want to read the file once.
I know of 2 ways to duplicate the data using tee and moreutils' pee:
<file tee >(md5sum > file.md5) | grep "ABC" > file.filt
<file pee 'md5sum > file.md5' 'grep "ABC" > file.filt'
Is there another 'best' way? Which method will make the fewest copies? Does it make a difference which program is >() or |-ed to? Will any of these approaches attempt to buffer data in RAM if one program is too slow? How do they scale to many reader programs?
Upvotes: 3
Views: 550
Reputation: 1699
tee
(command) opens each file using fopen
, but sets _IONBF
(unbuffered) on each. It read
s from stdin, and fwrite
s to each FILE*.
pee
(command) popen
s each command, sets each to unbuffered, read
s from stdin, and fwrite
s to each FILE*.
popen
uses pipe(2), which has a capacity of 65536 bytes. Writes to a full buffer will block. pee
also uses /bin/sh to interpret the command, but I think that will not add any buffering/copying.
mkfifo
(command) uses mkfifo
(libc), which use pipes underneath, opening the file/pipe blocks until the other end is opened.
bash
<>() syntax (subst.c:5712) uses either pipe
or mkfifo
. pipe
if /dev/fds are supported. It does not use the c fopen
calls so does not set the buffering.
So all three variants (pee, tee >(), mkfifo ...) should end up with identical behaviour, reading from stdin and writing to pipes without buffering. The data is duplicated at each read (from kernel to user), and then again at each write (user back to kernel), I think tee
s fwrites will not cause an extra layer of copying (as there is no buffer). Memory usage could increase to a maximum of 65536 * num_readers + 1 * read_size (if no one is reading). tee
writes to stdout first, then each file/pipe in order.
Given this pee just works around other shells (fish!) lack of >() operator equivalent, there seems to be no need for it with bash. I prefer tee when you have bash, but pee is nice when you don't. The bash <() is not replaced by pee of course. Manually mkfifoing and redirecting is tricky and unlikely to deal with errors nicely.
pee
could probably be changed by implementing using the tee
library function (instead of fwrite). I think this would cause the input to be read at the speed of the fastest reader, and potentially fill up the kernel buffers.
Upvotes: 1
Reputation: 5762
AFAIK, there is no "best way" to achieve this. But I can give you another approach, more verbose, not a one liner, but maybe clearer because each command is written in its own. Use named pipes:
mkfifo tmp1 tmp2
tee tmp1 > tmp2 < file &
cat tmp1 | md5sum > file.md5 &
cat tmp2 | grep "ABC" > file.filt &
wait
rm tmp1 tmp2
tee
to the named pipes the input file (tee outputs its input in the standard output, so the last name pipe must be a redirection), let it run in background.The drawback of this approach, when the programs have gread variability in theirs speeds is that all of them will read the files at the same pace (limit is the buffer size, once full for one of the pipes, the others will have to wait too), so if one of them is resource-hungry (like memory hungry), the resources will be used for the whole lifespan of all the processes.
Upvotes: 0