Reputation: 793
I have a gzipped file that I've split into 3 separate files: xaa, xab, xac. I make a fifo
mkfifo p1
and reassemble the files by reading from it, also calculating a checksum and unzipping the file in a pipe:
cat p1 p1 p1 | tee >(sha1sum > sha1sum_new.txt) | gunzip > output_file.txt
This works just fine if I feed the pipe from another terminal with
cat xaa > p1
cat xab > p1
cat xac > p1
but if I feed the pipe with a single line,
cat xaa > p1; cat xab > p1; cat xac > p1
the receiving pipeline hangs, no checksum is produced, and although an output file is produced, it is truncated - but by an amount smaller than the final file size.
Why is the behavior in the second case different from the first?
Upvotes: 1
Views: 58
Reputation: 21213
Interesting question. As the other answer mentions, you have a race condition - and I am pretty sure of that. In fact, you have a race condition in both cases, but in the former you're just lucky it doesn't happen because maybe your files are small and can be read before you enter the next command line. Allow me to explain.
So, a little bit of background first:
cat
opens each file you feed it as an argument sequentially, prints it to the output, and then closes the file and moves on to the next file. The exact details of whether cat
opens each file sequentially or opens them all first and then writes each file sequentially may vary, but it's not relevant for the discussion. In both cases, you'll have a race conditionopen(2)
syscall will block on a FIFO / pipe until the other end is opened. So for example, if process pid1
opens the FIFO for reading, open(2)
will block until, say, pid2
opens the FIFO for writing. In other words, opening a FIFO that has no active readers or writers implicitly synchronizes both processes and guarantees that a process will not read from a pipe that has no writer yet, or that a writer will not write to a pipe that has no reader yet. But as we will see, this will be problematic.What's really happening
When you do this:
cat xaa > p1
cat xab > p1
cat xac > p1
Things are really slow, because humans are slow. After you enter the first line, cat
opens p1
for writing. The other cat
is blocked on opening it for reading (or maybe not yet, but let's assume it is). Once both cat
processes open p1
- one for writing, the other for reading - data starts to flow.
And then, before you even have the chance to enter the next command line (cat xab >p1
), the whole file flows through the pipe and everyone is happy - the cat
reader process sees an end of file on the pipe, calls close(2)
, the cat
writer finishes writing the file, and closes p1
. The cat
reader moves on to the next file (which is p1
again), opens it, and blocks because no active writers have opened the fifo yet.
Then, you, slow human, enter the next command line, which causes another cat
writer process to open the FIFO, which unblocks the other cat
that is waiting to open for reading, and everything happens again. And then again for the third command line.
When you put everything in one line in the shell, things happen way too fast.
Let's differentiate the 3 cat
invocations. Call it cat1
, cat2
and cat3
:
cat1 xaa > p1; cat2 xab > p1; cat3 xac > p1
The shell executes each command sequentially, waiting for the previous command to finish before moving to the next one.
However, it might just be the case that cat1
finished writing everything to p1
and exits, the shell moves on to cat2
, which opens the FIFO and starts writing the contents of p1
again, and the cat
reader didn't have the chance to finish reading what cat1
wrote in the first place, and now suddenly the cat
reader "thinks" it's still reading from the first file (the first p1
), but at some point it starts reading the data that cat2
started pushing into the pipe (as if it was in the first p1
). It has no way of knowing that the first "copy" of the data is over if cat2
is faster and opens the FIFO before the cat
reader finishes reading what cat1
wrote.
Yes, subtle, but it's exactly what is happening.
Then, of course, input eventually comes to an end, and the cat
reader will think that the first p1
is done and moves to the next p1
, opening it and waiting for the next writer to open it. But there will never be a next writer! It blocks forever, and the whole pipeline is stalled forever.
How to fix it
The solution in the other answer solves the problem. You mentioned in the comments that it might not be enough for you because you don't control when and how a new writer opens and uses the pipe.
So I suggest this instead:
cat
standard input to p1
in the background: cat >p1 &
. When you're done, kill the background job.cat p1 | tee >(sha1sum ...)
or using the method proposed in the other answer (tee >(...) <p1
). After all, opening a FIFO once should be enough no matter how complex your system is; FIFOs by nature always give you the data in a first in first out fashion.Keep the background cat
writer running as long as you know that there's a chance of new files arriving / new writers opening the FIFO and using it. Don't forget to terminate the background job when you know that input is over.
Upvotes: 1
Reputation: 531165
I'm not positive, but I think there is a race condition involved. Consider using this as a simpler alternative:
tee >(sha1sum > sha1sum_new.txt) < p1 | gunzip > output_file.txt
and feed p1
with a single command
cat xaa xab xac > p1
This way, you open p1
for writing exactly once, and open it for reading exactly once.
Upvotes: 1