modeller
modeller

Reputation: 3850

Java 8's Files.lines(): Performance concern for very long line

Java 8's stream API has been convenient and gained popularity. For file I/O, I found that two API's are provided to generate stream output: Files.lines(path), and bufferedReader.lines();

I did not find a stream API which provide Stream of fixed-sized buffers for reading files, though.

My concern is: in case of files with very long line, e.g. a 4GB file with only a single line, aren't these line-based API very inefficient?

The line-based reader will need at least 4GB memory to keep that line. Compared to a fix-sized buffer reader (fileInputStream.read(byte[] b, int off, int len)), which takes at most the buffer size of memory.

If the above concern is true, are there any Stream API for file i/o API which are more efficient?

Upvotes: 0

Views: 2665

Answers (2)

Holger
Holger

Reputation: 298203

It depends on how you want to process the data, which method of delivery is appropriate. So if your processing requires processing the data line by line, there is no way around doing it that way.

If you really want fixed size chunks of character data, you can using the following method(s):

public static Stream<String> chunks(Path path, int chunkSize) throws IOException {
    return chunks(path, chunkSize, StandardCharsets.UTF_8);
}
public static Stream<String> chunks(Path path, int chunkSize, Charset cs)
throws IOException {
    Objects.requireNonNull(path);
    Objects.requireNonNull(cs);
    if(chunkSize<=0) throw new IllegalArgumentException();

    CharBuffer cb = CharBuffer.allocate(chunkSize);
    BufferedReader r = Files.newBufferedReader(path, cs);
    return StreamSupport.stream(
        new Spliterators.AbstractSpliterator<String>(
            Files.size(path)/chunkSize, Spliterator.ORDERED|Spliterator.NONNULL) {
            @Override public boolean tryAdvance(Consumer<? super String> action) {
                try { do {} while(cb.hasRemaining() && r.read(cb)>0); }
                catch (IOException ex) { throw new UncheckedIOException(ex); }
                if(cb.position()==0) return false;
                action.accept(cb.flip().toString());
                return true;
            }
    }, false).onClose(() -> {
        try { r.close(); } catch(IOException ex) { throw new UncheckedIOException(ex); }
    });
}

but I wouldn’t be surprised if your next question is “how can I merge adjacent stream elements”, as these fixed sized chunks are rarely the natural data unit to your actual task.

More than often, the subsequent step is to perform pattern matching within the contents and in this case, it’s better to use Scanner in the first place, which is capable of performing pattern matching while streaming the data, which can be done efficiently as the regex engine tells whether buffering more data could change the outcome of a match operation (see hitEnd() and requireEnd()). Unfortunately, generating a stream of matches from a Scanner has only been added in Java 9, but see this answer for a back-port of that feature to Java 8.

Upvotes: 2

Kayaman
Kayaman

Reputation: 73558

If you have a 4GB text file with a single line, and you're processing it "line by line", then you've made a serious error in your programming by not understanding the data you're working with.

They're convenience methods for when you need to do simple work with data like CSV or other such format, and the line sizes are manageable.

A real life example of a 4GB text file with a single line would be an XML file without line breaks. You would use a streaming XML parser to read that, not roll your own solution that reads line by line.

Upvotes: 5

Related Questions