Kyle Powell
Kyle Powell

Reputation: 31

Multiprocessing in Python. Why is there no speed-up?

I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:

    import multiprocessing
    from multiprocessing import Pool
    import numpy as np
    import time


    def tester(num):
        return np.cos(num)


    if __name__ == '__main__':


        starttime1 = time.time()
        pool_size = multiprocessing.cpu_count()
        pool = multiprocessing.Pool(processes=pool_size,
                            )
        pool_outputs = pool.map(tester, range(5000000))
        pool.close()
        pool.join()
        endtime1 = time.time()    
        timetaken = endtime1 - starttime1  

        starttime2 = time.time()
        for i in range(5000000):
            tester(i)
        endtime2 = time.time()
        timetaken2 = timetaken = endtime2 - starttime2

        print( 'The time taken with multiple processes:', timetaken)
        print( 'The time taken the usual way:', timetaken2)

I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?

Note that I learned all of this from this. http://pymotw.com/2/multiprocessing/communication.html

I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".

Upvotes: 3

Views: 9991

Answers (3)

CoMartel
CoMartel

Reputation: 3591

First, you wrote :

timetaken2 = timetaken = endtime2 - starttime2

So it is normal if you have the same times displayed. But this is not the important part.

I ran your code on my computer (i7, 4 cores), and I get :

('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)

The multiprocessed loop is slower than doing the for loop. Why?

The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.

But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.

I tried with following tester, which is much more expensive, on 2000 runs:

def expenser_tester(num):
    A=np.random.rand(10*num) # creation of a random Array 1D
    for k in range(0,len(A)-1): # some useless but costly operation
        A[k+1]=A[k]*A[k+1] 
    return A

('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)

You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2) Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.

If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.

You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.

Upvotes: 5

Marc Cayuela
Marc Cayuela

Reputation: 1592

If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.

Upvotes: 0

6502
6502

Reputation: 114569

Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.

Try making 5 computations of 1000000 cos values and you should see them going in parallel.

Upvotes: 8

Related Questions