Reputation: 21602
Here is some test of multiprocessing.Pool
vs multiprocessing.pool.ThreadPool
vs sequential version, I wonder why multiprocessing.pool.ThreadPool
version is slower than sequential version?
Is it true that multiprocessing.Pool
is faster because it use processes (i.e. without GIL) and multiprocessing.pool.ThreadPool
use threads(i.e. with GIL) despite the name of the package multiprocessing
?
import time
def test_1(job_list):
from multiprocessing import Pool
print('-' * 60)
print("Pool map")
start = time.time()
p = Pool(8)
s = sum(p.map(sum, job_list))
print('time:', time.time() - start)
def test_2(job_list):
print('-' * 60)
print("Sequential map")
start = time.time()
s = sum(map(sum, job_list))
print('time:', time.time() - start)
def test_3(job_list):
from multiprocessing.pool import ThreadPool
print('-' * 60)
print("ThreadPool map")
start = time.time()
p = ThreadPool(8)
s = sum(p.map(sum, job_list))
print('time:', time.time() - start)
if __name__ == '__main__':
job_list = [range(10000000)]*128
test_1(job_list)
test_2(job_list)
test_3(job_list)
Output:
------------------------------------------------------------
Pool map
time: 3.4112906455993652
------------------------------------------------------------
Sequential map
time: 23.626681804656982
------------------------------------------------------------
ThreadPool map
time: 76.83279991149902
Upvotes: 1
Views: 5980
Reputation: 155313
Your tasks are purely CPU bound (no blocking on I/O) and are not using any extension code that manually releases the GIL to do large amounts of number-crunching without the involvement of Python-level reference counted objects (e.g. hashlib
hashing large inputs, large array numpy
computations, etc.). As such, the definition of the GIL prevents you from extracting any parallelism from the code; only one thread can hold the GIL at once and execute Python bytecode, and you go slower because:
In short, yes, ThreadPool
does what it says on the tin: It provides the same API as Pool
, but backed by threads, not worker processes, and therefore does not avoid GIL limitations and overhead. It wasn't even documented directly until recently; instead, it was indirectly documented by the multiprocessing.dummy
docs that were even more explicit about providing the multiprocessing
API but backed by threads, not processes (you used it as multiprocessing.dummy.Pool
, without the name actually including the word "Thread").
I'll note that your test makes Pool
look better than it normally would. Usually, Pool
will do poorly with tasks like this (lots of data, little computation relative to size of data), because the cost of serializing the data and sending it to the child processes outweighs the minor gains from parallelizing the work. But since your "large data" was represented by range
objects (which are serialized cheaply, as a reference to the range
class and the arguments to reconstruct it with), very little data is transferred to and from the workers. If you used real data (realized list
s of int
), the benefits from Pool
would go down dramatically. For example, just by changing the definition of job_list
to:
job_list = [[*range(10000000)]] * 128
the time for Pool
on my machine (which takes 3.11 seconds for your unmodified Pool
case) jumps to 8.11 seconds. And even that's a lie, because the pickle
serialization code recognizes the same list
repeated over and over and serializes the inner list
just once, then repeats it with a quick "see that first list
" code. I'd tell you what using:
job_list = [[*range(10000000)] for _ in range(128)]
did to the runtime, but I nearly crashed my machine just trying to run that line (it would require ~46 GB of memory to create said list
of list
s, and that cost would be paid once in the parent process, then again across the children); suffice to say, Pool
would lose quite badly especially in cases where the data fits in RAM once, but not twice.
Upvotes: 4