QtizedQ
QtizedQ

Reputation: 310

Numpy Python Parallelization Scaling Identically to Serial Computation

I am trying to show some sort of improvement by parallelizing a toy piece of code in Python before I attempt to parallelize my main program. The approach I have decided to take is to perform four matrix inversions in parallel, and then perform the same matrix inversion one time and multiply by four to get an estimate for how long the problem would take to finish in serial. I understand there will be some constant overhead with parallelizing, but as a first order approximation this should be fine for my purposes.

The issue I am running into is that the parallel computations are running slowly enough that it appears each of the four inversions are just being executed serially.

I have included my code to compare with. Note that I have tried two different approaches, one using just multiprocessing, and one using joblib (commented out) and both give me comparable runtimes. mp.cpu_count returns 20 CPUs available, so I am confident that this is not an issue with my computer.

from joblib import Parallel, delayed
import multiprocessing as mp
import numpy as np
import time

if __name__ == '__main__':
    A = np.random.random((10000,10000))
    tA = [A,A,A,A]
    print mp.cpu_count()

    pool = mp.Pool(4)
    t = time.time()
    # results = Parallel(n_jobs=4)(delayed(np.linalg.inv)(i) for i in tA)
    results = pool.map(np.linalg.inv, tA)
    print time.time() - t

    t = time.time()
    temp = np.linalg.inv(A)
    print (time.time()-t)*len(tA)

For reference, here is the output when I run this code

20
62.151000229
52.516007477

Upvotes: 0

Views: 256

Answers (1)

sascha
sascha

Reputation: 33522

Check your cpu-load when not using process-based parallelization. I would not be surprised to see, that your non-parallel-code is heavily using threads (transparently). Going by cpu-load it's not a proof, but a strong indication.

If your system is set up correctly, BLAS/LAPACK will be used (dgetrf) and most implementations should be highly threaded (within C or Fortran code)! If that's true, that's obviously a problem in your benchmark.

Upvotes: 2

Related Questions