Reputation: 5596
Below is the code where I try to process lines read by from the file in parallel stream and in Normal stream. Surprisingly, parallel stream gives no improvements over normal stream . Am I missing something here ?
Files.walk(Paths.get(tweetFilePath + LocalDate.now())).forEach(
filePath -> {
if (Files.isRegularFile(filePath) && !filePath.toString().endsWith(".DS_Store")) {
long startTime = System.currentTimeMillis();
try {
Files.lines(filePath).parallel().forEach(line -> {
try {
System.out.println(line);
} catch (Exception e) {
System.out.println("Not able to crunch"+ e);
}
});
} catch (Exception e) {
System.out.println("Bad line in file ");
}finally {
System.out.println("total time required:" + (System.currentTimeMillis() - startTime));
}
}
});
Upvotes: 1
Views: 712
Reputation: 100209
The first problem is that Files.lines
parallelize badly especially for files shorter than 1024 lines. Check this question for details. If you know in advance that your file is short enough to fit in memory, it would be better to read it sequentially to the List
first:
Files.readAllLines(filePath, StandardCharsets.UTF_8).parallelStream()...
I have some ideas on how to improve this, but it's still not ideal solution. The fact is the Stream API parallelization is quite ineffective if you cannot even estimate the elements count in the input stream.
The second problem is your forEach
operation. Here you just use System.out
, so all the threads will try to write to the same PrintStream
fighting for the same resource, thus most of the time will be spent in waiting for the lock release. Internally it uses BufferedWriter
where all writes are synchronized. You may benefit from parallelization if you don't use shared resources in parallel operations.
By the way Files.lines
creates a stream over BufferedReader
. It's better to manage it with try-with-resources
statement. Otherwise the files will be closed only when underlying FileInputStream
objects are garbage-collected, so you may sporadically have errors like "too many open files".
Upvotes: 1
Reputation: 9714
Looks like currently, Files.lines
reads the file linearly, so the parallel call cannot split the source stream into sub-streams for parallel processing.
See here for details. Relevant section quoted below:
What if my source is based on IO?
Currently, JDK IO-based Stream sources (for example BufferedReader.lines()) are mainly geared for sequential use, processing elements one-by-one as they arrive. Opportunities exist for supporting highly efficient bulk processing of buffered IO, but these currently require custom development of Stream sources, Spliterators, and/or Collectors. Some common forms may be supported in future JDK releases.
Upvotes: 1