abhiraj gupta
abhiraj gupta

Reputation: 33

Why Multiprocessing is taking more time than sequential processing?

The below code is taking around 15 seconds to get the result. But when I run a it sequentially it only takes around 11 seconds. What can be the reason for this ?

import multiprocessing
import os
import time
def square(x):
    # print(os.getpid())
    return x*x

if __name__=='__main__':

    start_time = time.time()
    p = multiprocessing.Pool()
    r = range(100000000)
    p1 = p.map(square,r)
    end_time = time.time()
    print('time_taken::',end_time-start_time)
    

Sequential code

start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()

Upvotes: 0

Views: 1238

Answers (1)

jgru
jgru

Reputation: 191

Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:

First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.

Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.

E.g., this could be accomplished with the following modifications of your code:

def square(x):
    return x ** 2


def square_chunk(i, j):
    return list(map(square, range(i, j)))


def calculate_in_parallel(n, c=4):
    """Calculates a list of squares in a parallelized manner"""
    result = []
    step = math.ceil(n / c)

    with Pool(c) as p:
        partial_results = p.starmap(
            square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
        )

        for res in partial_results:
            result += res

    return result

Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.

Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.

By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.

Comparison of runtime performance Edit: As @KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.

enter image description here

Upvotes: 6

Related Questions