Sagar
Sagar

Reputation: 5596

Parallel stream creates only one thread and gives result as fast as normal stream

Below is the code where I try to process lines read by from the file in parallel stream and in Normal stream. Surprisingly, parallel stream gives no improvements over normal stream . Am I missing something here ?

Files.walk(Paths.get(tweetFilePath + LocalDate.now())).forEach(
            filePath -> {
                if (Files.isRegularFile(filePath) && !filePath.toString().endsWith(".DS_Store")) {
                    long startTime = System.currentTimeMillis();
                    try {

                        Files.lines(filePath).parallel().forEach(line -> {
                                try {
                                    System.out.println(line);

                                } catch (Exception e) {
                                    System.out.println("Not able to crunch"+ e);
                                }

                        });
                    } catch (Exception e) {
                        System.out.println("Bad line in file ");
                    }finally {
                        System.out.println("total time required:" + (System.currentTimeMillis() - startTime));

                    }   
                }
            });

Upvotes: 1

Views: 712

Answers (2)

Tagir Valeev
Tagir Valeev

Reputation: 100209

The first problem is that Files.lines parallelize badly especially for files shorter than 1024 lines. Check this question for details. If you know in advance that your file is short enough to fit in memory, it would be better to read it sequentially to the List first:

Files.readAllLines(filePath, StandardCharsets.UTF_8).parallelStream()...

I have some ideas on how to improve this, but it's still not ideal solution. The fact is the Stream API parallelization is quite ineffective if you cannot even estimate the elements count in the input stream.

The second problem is your forEach operation. Here you just use System.out, so all the threads will try to write to the same PrintStream fighting for the same resource, thus most of the time will be spent in waiting for the lock release. Internally it uses BufferedWriter where all writes are synchronized. You may benefit from parallelization if you don't use shared resources in parallel operations.

By the way Files.lines creates a stream over BufferedReader. It's better to manage it with try-with-resources statement. Otherwise the files will be closed only when underlying FileInputStream objects are garbage-collected, so you may sporadically have errors like "too many open files".

Upvotes: 1

Aishwar
Aishwar

Reputation: 9714

Looks like currently, Files.lines reads the file linearly, so the parallel call cannot split the source stream into sub-streams for parallel processing.

See here for details. Relevant section quoted below:

What if my source is based on IO?

Currently, JDK IO-based Stream sources (for example BufferedReader.lines()) are mainly geared for sequential use, processing elements one-by-one as they arrive. Opportunities exist for supporting highly efficient bulk processing of buffered IO, but these currently require custom development of Stream sources, Spliterators, and/or Collectors. Some common forms may be supported in future JDK releases.

Upvotes: 1

Related Questions