Reputation: 5067
I am running on a machine with two AMD 7302 16 core processors (a total of 32 core). I'm on a Red Hat 8.4 system and using Python 3.10.6.
I've recently started learning the multiprocessing library. Inspired by first example on the documentation page, I wrote my own little code :
from multiprocessing import Pool
import numpy as np
import sys
import datetime
def f(x):
return x**2
def main(DataType="List", NThr=2, Vectorize=False):
N = 5*10**7 # number of elements
n = NThr # number of threads
y = np.zeros(N)
# Use list
if(DataType == "List"):
x = []
for i in range(N):
x.append(i)
# Use Numpy
elif(DataType=="Numpy"):
x = np.zeros(N)
for i in range(len(x)):
x[i] = i
# Run parallel code
t0 = datetime.datetime.now()
if(n==1):
if(DataType == "Numpy" and Vectorize == True):
y = np.vectorize(f)(x)
else:
for i in range(len(x)):
y[i] = f(x[i])
else:
with Pool(n) as p:
y = p.map(f, x)
t1 = datetime.datetime.now()
dt = (t1 - t0).total_seconds()
print("{} : Vect = {}, n = {}, time : {}s".format(DataType,Vectorize,n,dt))
sys.exit(0)
if __name__ == "__main__":
main()
I noticed that when I try to run p.map()
over a numpy array, it performs substantially worse. Here is the output from several runs (python mycode.py
) after twiddling the args to main
:
Numpy : Vect = True, n = 1, time : 9.566441s
Numpy : Vect = False, n = 1, time : 16.00333s
Numpy : Vect = False, n = 2, time : 143.331352s
List : Vect = False, n = 1, time : 21.11657s
List : Vect = False, n = 2, time : 11.868897s
List : Vect = False, n = 5, time : 6.162561s
Look at the (Numpy, n=2) run at 143s. It's run time is substantially worse than the (List, n=2) run at 11.9s. It is also much worse than either of the (Numpy, n=1) runs.
Question :
What makes numpy arrays take so long to run with the multiprocessing library, specifically when NThr==2
?
EDIT :
Per a comment's suggestion, I ran both versions (Numpy, n=2) and (List, n=2) through the profiler :
>>> import cProfile
>>> from mycode import main
>>> cProfile.run('main()')
and compared them side by side. The most time consuming function calls and the calls with different numbers to them are listed below.
For Numpy version :
ncalls tottime percall cumtime percall filename:lineno(function)
# Time consuming
1 0.000 0.000 138.997 138.997 pool.py:362(map)
1 0.000 0.000 138.956 138.956 pool.py:764(wait)
1 0.000 0.000 138.956 138.956 pool.py:767(get)
4 0.000 0.000 138.957 34.739 threading.py:288(wait)
4 0.000 0.000 138.957 34.739 threading.py:589(wait)
14/1 0.000 0.000 145.150 145.150 {built-in method builtins.exec}
19 138.957 7.314 138.957 7.314 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
6 0.000 0.000 0.088 0.015 popen_fork.py:24(poll)
1 0.000 0.000 0.088 0.088 popen_fork.py:36(wait)
1 0.000 0.000 0.088 0.088 process.py:142(join)
10 0.000 0.000 0.000 0.000 process.py:99(_check_closed)
18 0.000 0.000 0.000 0.000 util.py:48(debug)
76 0.000 0.000 0.000 0.000 {built-in method builtins.len}
2 0.000 0.000 0.000 0.000 {built-in method numpy.zeros}
17 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
6 0.088 0.015 0.088 0.015 {built-in method posix.waitpid}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
For List version :
ncalls tottime percall cumtime percall filename:lineno(function)
# Time consuming
1 0.000 0.000 13.961 13.961 pool.py:362(map)
1 0.000 0.000 13.920 13.920 pool.py:764(wait)
1 0.000 0.000 13.920 13.920 pool.py:767(get)
4 0.000 0.000 13.921 3.480 threading.py:288(wait)
4 0.000 0.000 13.921 3.480 threading.py:589(wait)
14/1 0.000 0.000 24.475 24.475 {built-in method builtins.exec}
19 13.921 0.733 13.921 0.733 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
7 0.000 0.000 0.132 0.019 popen_fork.py:24(poll)
2 0.000 0.000 0.132 0.066 popen_fork.py:36(wait)
2 0.000 0.000 0.132 0.066 process.py:142(join)
12 0.000 0.000 0.000 0.000 process.py:99(_check_closed)
19 0.000 0.000 0.000 0.000 util.py:48(debug)
75 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method numpy.zeros}
18 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
7 0.132 0.019 0.132 0.019 {built-in method posix.waitpid}
50000003 2.780 0.000 2.780 0.000 {method 'append' of 'list' objects}
Note that for the List version, there are 50000003 calls to append()
compared to 3 calls to append()
in the Numpy version. due to the initialization of the x
.
Upvotes: 3
Views: 393
Reputation: 16174
There's so much going on in the question it's hard to know where to start! The main issues are:
f
isn't "doing" anything computationally intensivemultiprocessing.Pool.map(fn, objs)
then objs
is iterated overPut these together and you're basically just benchmarking the pickle
module. Which for your case is just benchmarking pickle.loads(pickle.dumps(list(y)))
and results in the terrible performance you see.
To explain in a bit more detail, I'll write some code:
from time import perf_counter
from multiprocessing import Pool
import pickle
import numpy as np
class CtxTimer:
"Context manager to time execution of code and save runtime for later"
def __init__(self, message):
self.message = message
def __enter__(self):
self.start_time = perf_counter()
return self
def __exit__(self, exc_type, exc_value, traceback):
self.end_time = perf_counter()
print(
f"{self.message:15} "
f"time={self.wall*1000:.2f}ms"
)
@property
def wall(self):
return self.end_time - self.start_time
def f(x):
return x**2
# bit shorter than your example, but don't want to wait that long
objs = list(range(10**6))
# direct conversion to int64 numpy array (your code was using floats)
objs_np = np.array(objs)
with CtxTimer("Naive Python") as py_naive:
list(map(f, objs))
with Pool(4) as pool:
with CtxTimer("MP Pool Python") as py_pool:
pool.map(f, objs)
with CtxTimer("MP Pool Numpy") as np_pool:
pool.map(f, objs_np)
with CtxTimer("Pickle Python") as py_dumps:
buf = pickle.dumps(objs)
print(f" {len(buf)/1e6 = :.2f}")
with CtxTimer("Unpickle Python") as py_loads:
pickle.loads(buf)
with CtxTimer("Pickle Numpy") as np_dumps:
buf = pickle.dumps(list(objs_np))
print(f" {len(buf)/1e6 = :.2f}")
with CtxTimer("Unpickle Numpy") as np_loads:
pickle.loads(buf)
On my deliberately underclocked laptop and limiting to 4 subprocesses, this gives:
The Naive Python
block takes ~550ms. So in 1 second, we can evaluate ~3million function calls, with each just squaring an integer.
Pickle Python
takes ~30ms, and Unpickle Python
takes ~60ms. These operations are going to be happening every time you use the Pool
to transfer objects between processes. Note that this produces a ~5MB object.
Pickle Numpy
takes ~3700ms and Unpickle Numpy
takes ~450ms. The difference is because the Python special cases a few common datatypes, like list
and int
, that we happen to be benchmarking. Because this is just a list
of scalar Numpy values, you hit the slow path.
MP Pool Python
takes ~300ms and MP Pool Numpy
takes ~4300ms.
To explain to these numbers, you need to think what is happening behind the scenes in the pool. First, there's the parent process that spawns the four child processes. This has to send all the work to the children, so this involves doing pickle.dumps
on everything and sending them to the child processes. It then waits for the results to come back and decodes them using pickle.loads
. This seems to be the limiting factor, especially so for the Numpy case. Each child process would get a quarter of the items, decode them, process them, then encode the results to send back.
Calculating this for the Numpy variant I get:
max(np_dumps.wall + np_loads.wall, (np_dumps.wall + py_naive.wall + np_loads.wall) / 4)
outputting ~4.2s, note I'm using max
because we have to wait for the slowest part of the system. This agrees pretty well with my observed ~4300ms, but the same calculation for the Python variant isn't so close.
Just to point out why Numpy vectorisation should be preferred, the following code:
with CtxTimer("Naive Numpy") as np_naive:
f(objs_np)
runs in ~2ms, i.e. more than 200x faster that the naive Python implementation. It's worth getting this speedup before dealing with the awkwardness of Python imposed process-based parallelism, which could give you a maximum 32x speedup doing pure-Python things, or 6400x when doing more optimal things.
Upvotes: 3