Reputation: 194
So, I wanted to process a file line by line in parallel, and as each line processing is independent I don't need it to be sequential. Also file is big so I don't want to read the full file into memory initially.
So I was thinking if I call java NIO File.lines()
and then process the resulting stream using n threads. Does it mean that only n lines will be read into the memory and processed?
I was thinking that this approach should be similar to a bucket processing approach where a thread reads n lines into a blocking queue, whereas a pool of threads processes those read lines in parallel.
Upvotes: 0
Views: 503
Reputation: 298183
The most important information first; the stream returned by Files.lines()
does not load the entire file into the memory.
The amount of data held in memory at the same time, is implementation dependent and will unavoidably raise when turning on parallel processing, but it does not depend on the file size (unless it’s a rather small file).
You may simply try whether it works for your case and ignore the details below.
In the reference implementation of Java 8, Files.lines()
does just delegate to BufferedReader.lines()
. The only difference is that a close handler is installed, so closing the stream will close the underlying BufferedReader
. So there are two buffers involved, one for the source data of the charset conversion and the one inside the BufferedReader
to identify the line boundaries before creating the line strings and passing them to the stream.
Since the size of these buffers does not depend on the size of the file, they are nothing to worry about when processing large files.
But as discussed in Reader#lines() parallelizes badly due to nonconfigurable batch size policy in its spliterator Java 8’s implementation of this stream is not well suited to parallel processing. It inherits a fallback splitting strategy which buffers at least 1024 elements and raises this chunk size by 1024 on each split operation, up to 33554432 elements per chunk. In case of a stream of lines, these are the number of lines which may exist in memory at a time, per thread. Depending on what a “large file” means to you, it may boil down to having the entire file in memory in the worst case or not the entire file but still too much for your use case.
This has been addressed by JDK-8072773, (fs) Files.lines needs a better splitting implementation for stream source for JDK 9. But it’s worth noting that the request has been implemented literally.
Instead of taking the opportunity to improve BufferedReader.lines()
to implement a Spliterator
directly, most of the code has not been touched at all. Instead Files.lines
just got a better splitting implementation and otherwise runs through the same inefficient code paths.
If the preconditions are fulfilled (the Path
is on the default filesystem, the charset is one of the supported, and the file can be mapped as a single ByteBuffer
), a special spliterator is created which does memory mapping and identifies potential split positions, i.e. line breaks, within that ByteBuffer
. But once the initial splitting is done, despite there’s a ByteBuffer
representing the chunk and knowledge about how to identify line breaks in it, the code create a new BufferedReader
for the chunk, to do the same inefficient decoding and streaming via wrapped Iterator
as before, but now for every chunk of work.
So while this is an improvement, it has the following shortcomings:
While it is an optimization specifically for large files, it stops being effective with files larger than 2 GiB
It only works for the default filesystem and certain charsets (currently UTF-8, ISO-LATIN-1, and US-ASCII)
If the workload is unbalanced, e.g. you have filter matching one half of the file, followed by the actual expensive work, so the initial splitting is not enough to give every CPU core some work, the parallel performance will suffer, as the new spliterators do not support splitting after the traversal started, not even the buffer array variant of BufferedReader
While the original approach has a single BufferedReader
with the two buffers mentioned above, you now have a BufferedReader
for every chunk of work.
It’s still as said at the beginning, Files.lines()
does not load the entire file into the memory. And your use case might benefit from these improvements. But it depends on the details.
Upvotes: 2