I am applying an operation to every element in a very large LinkedList<LinkedList<Double>> : list.stream().map(l -> l.stream().filter(d -> (Collections.max(l) - d) < 5) .collect(Collectors.toCollection(LinkedList::new))).collect(Collectors.toCollection(LinkedList::new)); On my computer (quad-core), parallel streams seem to be faster than using sequential streams: list.parallelStream().map(l -> l.parallelStream().filter(d -> (Collections.max(l) - d) < 5) .collect(Collectors.toCollection(LinkedList::new))).collect(Collectors.toCollection(LinkedList::new)); However, not every computer is going to be multi-core. My question is, will using parallel streams on a single-processor computer be noticeably slower than using sequential streams?

Will using a parallel stream on a single-core processor be slower than using a sequential stream?

Answers (3)

Reputation: 171

I did three benchmarking tests, one testing Holger's suggested optimizations, one using parallel and sequential streams on my quad-core computer (Asus FX550IU-WSFX), without optimizations, and one using parallel and sequential streams on a single core computer (Dell Optiplex 170L), also without optimizations. The lists for every test will contain 1.25 million elements.

Benchmarking code:

long average = 0;
for(int i = 0; i < 100; i++) {
    long start = System.nanoTime();
    //testing code...
    average += (System.nanoTime() - start);
}

System.out.println((average / 100) / 1000000 + "ms average");

Testing Optimizations (on a 4-core processor)

Un-optimized code:

List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d -> 
    (Collections.max(l) - d) < 5)
        .collect(Collectors.toCollection(LinkedList::new)))
            .collect(Collectors.toCollection(LinkedList::new));

Optimized code:

List<List<Double>> result =
    list.parallelStream()
        .map(l -> {
                double limit = Collections.max(l)-5;
                return l.parallelStream()
                        .filter(d -> limit < d)
                        .collect(Collectors.toList());
            })
        .collect(Collectors.toList());

Times:

Using the un-optimized code, the average execution time was 633ms, while using the optimized code, the average execution time was 25ms.

Testing Un-optimized Code on 4-core Processor

Sequential code:

List<List<Double>> result = list.stream().map(l -> l.stream().filter(d -> 
    (Collections.max(l) - d) < 5)
        .collect(Collectors.toCollection(LinkedList::new)))
            .collect(Collectors.toCollection(LinkedList::new));

Parallel code:

List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d -> 
        (Collections.max(l) - d) < 5)
            .collect(Collectors.toCollection(LinkedList::new)))
                .collect(Collectors.toCollection(LinkedList::new));

Times:

Using the sequential code, the average execution time was 879ms, while using the parallel code yields an average execution time of 539ms.

Testing Un-optimized Code on 1-core Processor

Sequential code:

List<List<Double>> result = list.stream().map(l -> l.stream().filter(d -> 
    (Collections.max(l) - d) < 5)
        .collect(Collectors.toCollection(LinkedList::new)))
            .collect(Collectors.toCollection(LinkedList::new));

Parallel code:

List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d -> 
        (Collections.max(l) - d) < 5)
            .collect(Collectors.toCollection(LinkedList::new)))
                .collect(Collectors.toCollection(LinkedList::new));

Times:

Using the sequential code, the average execution time was 2398ms, while using the parallel code yields an average execution time of 3942ms.

Conclusion

While using parallel streams on a single-core processor, and sequential streams on a four-core processor, does seem to be slower, optimizing the code resulted in the fastest execution times.

Upvotes: 0

Holger

Reputation: 298143

This is highly implementation specific, but usually, a parallel stream will go through a different code path for most operations, which implies performing additional work, but at the same time, the thread pool will be configured to the number of CPU cores.

E.g., if you run the following program

System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "1");
System.out.println("Parallelism: "+ForkJoinPool.getCommonPoolParallelism());
Set<Thread> threads = ConcurrentHashMap.newKeySet();
for(int run=0; run<2; run++) {
    IntStream stream = IntStream.range(0, 100);
    if(run==1) {
        stream = stream.parallel();
        System.out.println("Parallel:");
    }
    int chunks = stream
        .mapToObj(i->Thread.currentThread())
        .collect(()->new int[]{1}, (a,t)->threads.add(t), (a,b)->a[0]+=b[0])[0];
    System.out.println("processed "+chunks+" chunk(s) with "+threads.size()+" thread(s)");
}

it will print something like

Parallelism: 1
processed 1 chunk(s) with 1 thread(s)
Parallel:
processed 4 chunk(s) with 1 thread(s)

You can see the effect of splitting the workload, whereas splitting to four times the configured parallelism is not a coincidence, but also that only one thread is involved, so there is no inter-thread communication happening here. Whether the JVM’s optimizer will detect the single-thread nature of this operation and elide synchronization costs in this case, is, like anything else, an implementation detail.

All in all, the overhead is not very big and doesn’t scale with the actual amount of work, so if the actual work is big enough to benefit from parallel processing on SMP machines, the fraction of the overhead will be negligible on single core machines.

But if you care for performance, you should also look at the other aspects of your code.

By repeating an operation like Collections.max(l) for every element of l, you are combining two linear operations to an operation with quadratic time complexity. It’s easy to perform this operation only once instead:

List<List<Double>> result =
    list.parallelStream()
        .map(l -> {
                double limit = Collections.max(l)-5;
                return l.parallelStream()
                        .filter(d -> limit < d)
                        .collect(Collectors.toCollection(LinkedList::new));
            })
        .collect(Collectors.toCollection(LinkedList::new));

Depending on the list sizes, the impact of this little change, turning a quadratic operation to linear, might be far bigger than dividing the processing time by just the number of cpu cores (in the best case).

The other consideration is whether you really need a LinkedList. For most practical purposes, a LinkedList performs worse than, e.g. an ArrayList, and if you don’t need mutability, you may just use the toList() collector and let the JRE return the best list it can offer…

List<List<Double>> result =
    list.parallelStream()
        .map(l -> {
                double limit = Collections.max(l)-5;
                return l.parallelStream()
                        .filter(d -> limit < d)
                        .collect(Collectors.toList());
            })
        .collect(Collectors.toList());

Keep in mind that after changing the performance characteristics, rechecking whether the parallelization still has any benefit is recommended. It should also be checked for both stream operations individually. Usually, if the outer stream has a decent parallelization, turning the inner stream to parallel does not improve the overall performance.

Also, the benefit of parallel streams will be much higher if the source lists are random access lists instead of LinkedLists.

Upvotes: 7

tsolakp

Reputation: 5948

Soon we won't be having single core CPU any more. But if you are curious how threading works on single non hyper threaded core then see this answer: Why is threading works on a single core CPU?

So to answer your question the run time most likely will be better for sequential processing since it won't be involving thread starting, scheduling and synchronization.

Upvotes: 0

Will using a parallel stream on a single-core processor be slower than using a sequential stream?

Answers (3)

Related Questions