Java 8's Files.lines(): Performance concern for very long line

Question

Java 8's stream API has been convenient and gained popularity. For file I/O, I found that two API's are provided to generate stream output: Files.lines(path), and bufferedReader.lines();

I did not find a stream API which provide Stream of fixed-sized buffers for reading files, though.

My concern is: in case of files with very long line, e.g. a 4GB file with only a single line, aren't these line-based API very inefficient?

The line-based reader will need at least 4GB memory to keep that line. Compared to a fix-sized buffer reader (fileInputStream.read(byte[] b, int off, int len)), which takes at most the buffer size of memory.

If the above concern is true, are there any Stream API for file i/o API which are more efficient?

Holger · Accepted Answer

It depends on how you want to process the data, which method of delivery is appropriate. So if your processing requires processing the data line by line, there is no way around doing it that way.

If you really want fixed size chunks of character data, you can using the following method(s):

public static Stream chunks(Path path, int chunkSize) throws IOException {
    return chunks(path, chunkSize, StandardCharsets.UTF_8);
}
public static Stream chunks(Path path, int chunkSize, Charset cs)
throws IOException {
    Objects.requireNonNull(path);
    Objects.requireNonNull(cs);
    if(chunkSize<=0) throw new IllegalArgumentException();

    CharBuffer cb = CharBuffer.allocate(chunkSize);
    BufferedReader r = Files.newBufferedReader(path, cs);
    return StreamSupport.stream(
        new Spliterators.AbstractSpliterator(
            Files.size(path)/chunkSize, Spliterator.ORDERED|Spliterator.NONNULL) {
            @Override public boolean tryAdvance(Consumer action) {
                try { do {} while(cb.hasRemaining() && r.read(cb)>0); }
                catch (IOException ex) { throw new UncheckedIOException(ex); }
                if(cb.position()==0) return false;
                action.accept(cb.flip().toString());
                return true;
            }
    }, false).onClose(() -> {
        try { r.close(); } catch(IOException ex) { throw new UncheckedIOException(ex); }
    });
}

but I wouldn’t be surprised if your next question is “how can I merge adjacent stream elements”, as these fixed sized chunks are rarely the natural data unit to your actual task.

More than often, the subsequent step is to perform pattern matching within the contents and in this case, it’s better to use Scanner in the first place, which is capable of performing pattern matching while streaming the data, which can be done efficiently as the regex engine tells whether buffering more data could change the outcome of a match operation (see hitEnd() and requireEnd()). Unfortunately, generating a stream of matches from a Scanner has only been added in Java 9, but see this answer for a back-port of that feature to Java 8.

Java 8's Files.lines(): Performance concern for very long line

Answers (2)

Related Questions

Java 8&#39;s Files.lines(): Performance concern for very long line

Answers (2)

Related Questions

Java 8's Files.lines(): Performance concern for very long line