Reputation: 1180
I have an API which provides the client with a report. This report is generated using a large collection of unsorted data (hundreds of thousands of rows in 20 different tables). I have tried optimizing the code and got about a 40% improvement in timing but it still takes up to 1 minute.
I was researching and came across this code example:
from multiprocessing import Process
def func1():
print 'func1: starting'
for i in xrange(10000000): pass
print 'func1: finishing'
def func2():
print 'func2: starting'
for i in xrange(10000000): pass
print 'func2: finishing'
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p1.join()
p2.join()
And so I thought of maybe splitting the data in 2. The way I am generating the report is user - by - user (for user in users
). So if I split the users
array in N
equal pieces and run the code in parallel, I will theoretically divide the processing time by N
, right?
Edit:
As promised, I am here to show you the results.
After my research, I concluded that the best pooling method for my case is the starmap_async
method. Here is the code with which I realized my solution:
pool = mp.Pool(mp.cpu_count())
res = pool.starmap_async(generateReportRow, [(user, database, _client_id, _part_id, _range, state_counts, all_rows, days) for user in users]).get()
pool.close()
pool.join()
My server has 4 virtualized shared CPU cores which probably has an impact on the following results which are still relatively impressive.
On 3 different reports, I observed the following performance improvements:
Another thing which I concluded is that report A had the largest data set which lead to the highest performance improvement (3x) which confirms @A_forsteri's statement in their answer.
Upvotes: 2
Views: 184
Reputation: 64
The multiprocessing
module does provide true parallelism.
The creation of Process
objects and inter-process communication may create some overhead, so given T
original time your parallel processed time will be (T/N)+x
, where N is the number of Processes and x
is the overhead. The larger the data, the more negligible x
is.
On a side note, I'd suggeste looking into using Pools.
The map
function does the chopping for you.
import multiprocessing as mp
def gen_report(user):
#generate report here
pass
with mp.Pool(initargs=initargs) as report_generator:
report_generator.map(func = gen_report, iterable=users)
Upvotes: 4