Memory usage for a parallel stream from File.lines()

Question

I am reading lines from large files (8GB+) using Files.lines(). If processing sequentially it works great, with a very low memory footprint. As soon as I add parallel() to the stream it seems to hang onto the data it is processing perpetually, eventually causing an out of memory exception. I believe this is the result of the Spliterator caching data when trying to split, but I'm not sure. My only idea left is to write a custom Spliterator with a trySplit method that peels off a small amount of data to split instead of trying to split the file in half or more. Has anyone else encountered this?

dkatzel · Accepted Answer

Tracing through the code my guess is the Spliterator used by Files.lines() is Spliterators.IteratorSpliterator. whose trySplit() method has this comment:

        /*
         * Split into arrays of arithmetically increasing batch
         * sizes.  This will only improve parallel performance if
         * per-element Consumer actions are more costly than
         * transferring them into an array.  The use of an
         * arithmetic progression in split sizes provides overhead
         * vs parallelism bounds that do not particularly favor or
         * penalize cases of lightweight vs heavyweight element
         * operations, across combinations of #elements vs #cores,
         * whether or not either are known.  We generate
         * O(sqrt(#elements)) splits, allowing O(sqrt(#cores))
         * potential speedup.
         */

The code then looks like it splits into batches of multiples of 1024 records (lines). So the first split will read 1024 lines then the next one will read 2048 lines etc on and on. Each split will read larger and larger batch sizes.

If your file is really big, it will eventually hit a max batch size of 33,554,432 which is 1<<25. Remember that's lines not bytes which will probably cause an out of memory error especially when you start having multiple threads read that many.

That also explains the slow down. Those lines are read ahead of time before the thread can process those lines.

So I would either not use parallel() at all or if you must because the computations you are doing are expensive per line, write your own Spliterator that doesn't split like this. Probably just always using a batch of 1024 is fine.

Memory usage for a parallel stream from File.lines()

Answers (2)

Related Questions