How does a Node Stream get chunked?

Question

I have seen several demonstrations of node streams that are something like

createReadStream(file)
  .pipe(filter(/test/i))
  .pipe(count())

Demonstrating how you would count the instances of the string "test" inside of file. filter and count are largely handwaved in the code samples, without any implementation shown.

Separately, I have seen several times that you can't count on where a chunk will be- they'll be arbitrarily sized.

Combining these two, why doesn't the code above have a bug, where a chunk may end right in the middle of the word "test"?

jfriend00 · Accepted Answer

You are correct that a generic readstream produces an unknown chunking of data so anything that operates on the chunk (without bugs) has to know that there can be matches across the boundary between chunks.

For the filter() operation to properly match across chunk boundaries, it would have to do some internal buffering keeping the last part of a chunk to see if a match spans across a chunk boundary when the next chunk arrives.

Without seeing the code for the filter() operation, we have no idea if it does that or not (properly handles matches across chunk boundaries). It could and, if written properly, it would not need to have bugs. But, if it doesn't do that or it isn't implemented properly, then it could indeed have bugs.

Note, there are some stream transforms that specifically create chunks that have a known boundary (which can make tasks like this easier). For example, there's a linereader transform that creates a stream that presents one whole line at a time. But, a generic readstream from a file does have unknown chunk boundaries.

How does a Node Stream get chunked?

Answers (1)

Related Questions