Wei-jia Huang
Wei-jia Huang

Reputation: 11

The efficiency between ProcessPoolExecutor and ThreadPoolExecutor

Efficient summation of a list involves parallelizing the addition process while maintaining the order. The objective is to create a hierarchical structure resembling a tree, where adjacent nodes are continuously added until a single node remains. I have experimented with both ProcessPoolExecutor and ThreadPoolExecutor in pursuit of optimal performance on a device equipped with 8 CPUs.

import concurrent.futures
import time

add_list = list(range(100))
temp_len = len(add_list)


loop_start = time.time()
max_node_num = 0

while temp_len > 1 :
    is_odd = temp_len%2

    group_len = int(temp_len/2 + is_odd)
    group = [(2*i, 2*i+1) for i in range(group_len - is_odd)]

    if is_odd == 1:
        last = add_list[-1]

    def group_cont(group):
        add_item = add_list[group[0]] + add_list[group[1]]

        return add_item

    with concurrent.futures.ThreadPoolExecutor() as executor:
        results = executor.map(group_cont, group)

    # with concurrent.futures.ProcessPoolExecutor() as executor:
    #     results = executor.map(group_cont, group)


    add_list = []
    for result in results:
        add_list.append(result)
    if is_odd == 1:
        add_list.append(last)
    temp_len = group_len

loop_end = time.time() 
print(f'Time: {round(loop_end-loop_start,3)}')

Output: Time: 0.012

"For ProcessPoolExecutor Time: 0.271"

Why the ThreadPoolExecutor is much more efficient than ProcessPoolExecutor? How to make ProcessPoolExecutor more efficient?

Upvotes: 0

Views: 5531

Answers (2)

Jonas B.
Jonas B.

Reputation: 1

As you are using the executor.map function: For ProcessPoolExecutor.map default is chunksize = 1 for ThreadPoolExecutor.map parameter is ignored. Increasing the chunksize will speed up the process.

See: https://docs.python.org/3/library/concurrent.futures.html "When using ProcessPoolExecutor, this method chops iterables into a number of chunks which it submits to the pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1"

Upvotes: -1

Stephen C
Stephen C

Reputation: 719386

Why the ThreadPoolExecutor is much more efficient than ProcessPoolExecutor?

As explained in the comments, this is because the overheads of creating processes and copying data between the processes in the latter exceeds the corresponding overheads in the former. Apparently, by a factor of 20 or so.

And this is despite the fact that the threads in the ThreadPoolExecutor version are most likely only using a one core (at a time) due to the GIL issue.

Note that the data is not transferred via disk files. It is actually transferred via a pipe (in current implementations). Even so, the overheads of serializing, writing, reading and deserializing a small object is orders of magnitude larger than passing a reference from one thread to another.

How to make ProcessPoolExecutor more efficient?

In this example, you most likely can't.

The real problem is that this is a bad example for testing parallelism1. Basically, the work in each parallel task is minuscule. Even under ideal circumstances, the overheads of dispatching each task to the another thread and getting the results back far exceed any possible speedup from (ideally) using multiple cores.

On the other hand, if the work performed by a task was a couple of orders of magnitude larger, you would probably find that the ProcessPoolExecutor version was faster.


1 - The characterization of your test code as "insanely inefficient" is apt, IMO.

Upvotes: 1

Related Questions