irritable_phd_syndrome
irritable_phd_syndrome

Reputation: 5067

Why do numpy arrays take longer to process than lists with multiprocessing library?

I am running on a machine with two AMD 7302 16 core processors (a total of 32 core). I'm on a Red Hat 8.4 system and using Python 3.10.6.

I've recently started learning the multiprocessing library. Inspired by first example on the documentation page, I wrote my own little code :

from multiprocessing import Pool
import numpy as np
import sys
import datetime

def f(x):
    return x**2

def main(DataType="List", NThr=2, Vectorize=False):
    N = 5*10**7           # number of elements
    n = NThr              # number of threads
    y = np.zeros(N)
    # Use list
    if(DataType == "List"):
        x = []
        for i in range(N):
            x.append(i)
    # Use Numpy
    elif(DataType=="Numpy"):
        x = np.zeros(N)
        for i in range(len(x)):
            x[i] = i
    # Run parallel code
    t0 = datetime.datetime.now()
    if(n==1):
        if(DataType == "Numpy" and Vectorize == True):
            y = np.vectorize(f)(x)
        else:
            for i in range(len(x)):
                y[i] = f(x[i])
    else:
        with Pool(n) as p:
            y = p.map(f, x)
    t1 = datetime.datetime.now()
    dt = (t1 - t0).total_seconds()
    print("{} : Vect = {}, n = {}, time : {}s".format(DataType,Vectorize,n,dt))
    sys.exit(0)

if __name__ == "__main__":
    main()

I noticed that when I try to run p.map() over a numpy array, it performs substantially worse. Here is the output from several runs (python mycode.py) after twiddling the args to main :

Numpy : Vect = True, n = 1, time : 9.566441s
Numpy : Vect = False, n = 1, time : 16.00333s
Numpy : Vect = False, n = 2, time : 143.331352s
List : Vect = False, n = 1, time : 21.11657s
List : Vect = False, n = 2, time : 11.868897s
List : Vect = False, n = 5, time : 6.162561s

Look at the (Numpy, n=2) run at 143s. It's run time is substantially worse than the (List, n=2) run at 11.9s. It is also much worse than either of the (Numpy, n=1) runs.

Question :

What makes numpy arrays take so long to run with the multiprocessing library, specifically when NThr==2?

EDIT :

Per a comment's suggestion, I ran both versions (Numpy, n=2) and (List, n=2) through the profiler :

>>> import cProfile                                                                                                                                                 
>>> from mycode import main                                                                                                                           
>>> cProfile.run('main()')

and compared them side by side. The most time consuming function calls and the calls with different numbers to them are listed below.

For Numpy version :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000  138.997  138.997 pool.py:362(map)
1    0.000    0.000  138.956  138.956 pool.py:764(wait)
1    0.000    0.000  138.956  138.956 pool.py:767(get)
4    0.000    0.000  138.957   34.739 threading.py:288(wait)
4    0.000    0.000  138.957   34.739 threading.py:589(wait)
14/1    0.000    0.000  145.150  145.150 {built-in method builtins.exec}
19  138.957    7.314  138.957    7.314 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
6    0.000    0.000    0.088    0.015 popen_fork.py:24(poll)
1    0.000    0.000    0.088    0.088 popen_fork.py:36(wait)
1    0.000    0.000    0.088    0.088 process.py:142(join)
10    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
18    0.000    0.000    0.000    0.000 util.py:48(debug)
76    0.000    0.000    0.000    0.000 {built-in method builtins.len}
2    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
17    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
6    0.088    0.015    0.088    0.015 {built-in method posix.waitpid}
3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}

For List version :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000   13.961   13.961 pool.py:362(map)
1    0.000    0.000   13.920   13.920 pool.py:764(wait)
1    0.000    0.000   13.920   13.920 pool.py:767(get)
4    0.000    0.000   13.921    3.480 threading.py:288(wait)
4    0.000    0.000   13.921    3.480 threading.py:589(wait)
14/1    0.000    0.000   24.475   24.475 {built-in method builtins.exec}
19   13.921    0.733   13.921    0.733 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
7    0.000    0.000    0.132    0.019 popen_fork.py:24(poll)
2    0.000    0.000    0.132    0.066 popen_fork.py:36(wait)
2    0.000    0.000    0.132    0.066 process.py:142(join)
12    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
19    0.000    0.000    0.000    0.000 util.py:48(debug)
75    0.000    0.000    0.000    0.000 {built-in method builtins.len}
1    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
18    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
7    0.132    0.019    0.132    0.019 {built-in method posix.waitpid}
50000003    2.780    0.000    2.780    0.000 {method 'append' of 'list' objects}

Note that for the List version, there are 50000003 calls to append() compared to 3 calls to append() in the Numpy version. due to the initialization of the x.

Upvotes: 3

Views: 393

Answers (1)

Sam Mason
Sam Mason

Reputation: 16174

There's so much going on in the question it's hard to know where to start! The main issues are:

  1. Your function f isn't "doing" anything computationally intensive
  2. When using multiprocessing.Pool.map(fn, objs) then objs is iterated over

Put these together and you're basically just benchmarking the pickle module. Which for your case is just benchmarking pickle.loads(pickle.dumps(list(y))) and results in the terrible performance you see.

To explain in a bit more detail, I'll write some code:

from time import perf_counter
from multiprocessing import Pool
import pickle

import numpy as np

class CtxTimer:
    "Context manager to time execution of code and save runtime for later"
    def __init__(self, message):
        self.message = message

    def __enter__(self):
        self.start_time = perf_counter()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.end_time = perf_counter()
        print(
            f"{self.message:15} "
            f"time={self.wall*1000:.2f}ms"
        )

    @property
    def wall(self):
        return self.end_time - self.start_time

def f(x):
    return x**2

# bit shorter than your example, but don't want to wait that long
objs = list(range(10**6))
# direct conversion to int64 numpy array (your code was using floats)
objs_np = np.array(objs)

with CtxTimer("Naive Python") as py_naive:
    list(map(f, objs))

with Pool(4) as pool:
    with CtxTimer("MP Pool Python") as py_pool:
        pool.map(f, objs)

    with CtxTimer("MP Pool Numpy") as np_pool:
        pool.map(f, objs_np)

with CtxTimer("Pickle Python") as py_dumps:
    buf = pickle.dumps(objs)

print(f"  {len(buf)/1e6 = :.2f}")

with CtxTimer("Unpickle Python") as py_loads:
    pickle.loads(buf)

with CtxTimer("Pickle Numpy") as np_dumps:
    buf = pickle.dumps(list(objs_np))

print(f"  {len(buf)/1e6 = :.2f}")

with CtxTimer("Unpickle Numpy") as np_loads:
    pickle.loads(buf)

On my deliberately underclocked laptop and limiting to 4 subprocesses, this gives:

  1. The Naive Python block takes ~550ms. So in 1 second, we can evaluate ~3million function calls, with each just squaring an integer.

  2. Pickle Python takes ~30ms, and Unpickle Python takes ~60ms. These operations are going to be happening every time you use the Pool to transfer objects between processes. Note that this produces a ~5MB object.

  3. Pickle Numpy takes ~3700ms and Unpickle Numpy takes ~450ms. The difference is because the Python special cases a few common datatypes, like list and int, that we happen to be benchmarking. Because this is just a list of scalar Numpy values, you hit the slow path.

  4. MP Pool Python takes ~300ms and MP Pool Numpy takes ~4300ms.

To explain to these numbers, you need to think what is happening behind the scenes in the pool. First, there's the parent process that spawns the four child processes. This has to send all the work to the children, so this involves doing pickle.dumps on everything and sending them to the child processes. It then waits for the results to come back and decodes them using pickle.loads. This seems to be the limiting factor, especially so for the Numpy case. Each child process would get a quarter of the items, decode them, process them, then encode the results to send back.

Calculating this for the Numpy variant I get:

max(np_dumps.wall + np_loads.wall, (np_dumps.wall + py_naive.wall + np_loads.wall) / 4)

outputting ~4.2s, note I'm using max because we have to wait for the slowest part of the system. This agrees pretty well with my observed ~4300ms, but the same calculation for the Python variant isn't so close.

Just to point out why Numpy vectorisation should be preferred, the following code:

with CtxTimer("Naive Numpy") as np_naive:
    f(objs_np)

runs in ~2ms, i.e. more than 200x faster that the naive Python implementation. It's worth getting this speedup before dealing with the awkwardness of Python imposed process-based parallelism, which could give you a maximum 32x speedup doing pure-Python things, or 6400x when doing more optimal things.

Upvotes: 3

Related Questions