Reputation: 2078
I have a Java application that currently parses an input file line by line in a loop and, each line, writes (through a specific API) a line in an output file.
The order of lines written is critical (lines are timestamped). Given that, I've chosen to execute the whole task in the main thread but the performance is terrible: I don't know any other way to maximize performance that doesn't exploit the use of multiple threads but since of the order-criticality I think there is no chance to employ it. By the way, I am not an expert in parallel execution, maybe I don't know there is a way to use it even here: is it the case?
P.S.: (75% of the time is spent on writes so the bottleneck is not in file parsing)
P.P.S: the application must run on a local machine.
Upvotes: 0
Views: 282
Reputation: 12019
If you've found that the most amount of time in the execution is in the writing of output, that's already a good indication of where the biggest gain in speed is. You've had the correct reflex of measuring before trying to optimise.
The first step is to make sure the FileWriter
(or FileOutputStream
, whichever you use) is wrapped in a BufferedWriter
or BufferedOutputStream
with a large enough buffer. This lets Java place output in a buffer and only flush it out to the file when it fills up. The amount of output doesn't change but it gets distributed over less I/O calls.
If that doesn't do it, look into tutorials on the use of the classes in the java.nio
package. This API introduced with Java 1.4 and an extension called NIO.2 providing file system capabilities was added in Java SE 7. These provide non-blocking I/O. The idea behind non-blocking I/O is that threads tend to spend a lot of time in traditional I/O operations waiting for the underlying OS and hardware to complete reads and writes, not performing any useful work in the meantime. With non-blocking I/O you place output into a buffer and have that written out asynchronously, meaning the write call returns immediately and can continue other useful work while the system calls complete the transfer. This is different from the regular BufferedWriter or BufferedOutputStream, which provide an in-memory buffer but still block on writing that out once the buffer gets flushed.
Using non-blocking I/O lets your application fetch more data from the input and/or process that while the output is being written making for better parallel processing. However, if there is a big bottleneck on the output side, to the extent that reading and processing always "catches up" with the writing, overwhelming the output channel's buffer, the output is still going to be the limiting factor. After all, in the end all the output must be written to a file.
A method for performing parallel output while still making sure the output remains in a predictable order is to use a memory-mapped file. You'd use java.io.RandomAccessFile
for this, which can be combined with java.nio for asynchronous writing as well. You could then write to different parts of the file in parallel. The disadvantage here is that for each part of your output you'd need to be sure it's of a specific length. Apart from some very specific use-cases (like fixed-length text or some binary format) this is usually not how things go.
Finally, processing the input in parallel and then making sure it's still written out in the correct order regardless of which parts of input were dealt with first is feasible. You just need to queue the output with some metadata (for example by wrapping it in some helper class) identifying the order and have the output not write anything out-of-order. Some libraries may offer something useful, but a priority queue with objects wrapping the output and having a sequence number could suffice. This is a pattern known as a resequencer in integration patterns.
Upvotes: 2