Dimitar
Dimitar

Reputation: 1180

Will I get a noticable performance improvement by parallel execution in Python?

I have an API which provides the client with a report. This report is generated using a large collection of unsorted data (hundreds of thousands of rows in 20 different tables). I have tried optimizing the code and got about a 40% improvement in timing but it still takes up to 1 minute.

I was researching and came across this code example:

from multiprocessing import Process

def func1():
  print 'func1: starting'
  for i in xrange(10000000): pass
  print 'func1: finishing'

def func2():
  print 'func2: starting'
  for i in xrange(10000000): pass
  print 'func2: finishing'

if __name__ == '__main__':
  p1 = Process(target=func1)
  p1.start()
  p2 = Process(target=func2)
  p2.start()
  p1.join()
  p2.join()

And so I thought of maybe splitting the data in 2. The way I am generating the report is user - by - user (for user in users). So if I split the users array in N equal pieces and run the code in parallel, I will theoretically divide the processing time by N, right?

Edit:

As promised, I am here to show you the results. After my research, I concluded that the best pooling method for my case is the starmap_async method. Here is the code with which I realized my solution:

pool = mp.Pool(mp.cpu_count())
res = pool.starmap_async(generateReportRow, [(user, database, _client_id, _part_id, _range, state_counts, all_rows, days) for user in users]).get()
pool.close()
pool.join()

My server has 4 virtualized shared CPU cores which probably has an impact on the following results which are still relatively impressive.

On 3 different reports, I observed the following performance improvements:

  1. Report A from 90 seconds to 30 seconds (1/3)
  2. Report B from 150 seconds to 80 seconds (roughly 1/2)
  3. Report C from 40 seconds to 20 seconds (1/2)

Another thing which I concluded is that report A had the largest data set which lead to the highest performance improvement (3x) which confirms @A_forsteri's statement in their answer.

Upvotes: 2

Views: 184

Answers (1)

A_forsteri
A_forsteri

Reputation: 64

The multiprocessing module does provide true parallelism. The creation of Process objects and inter-process communication may create some overhead, so given T original time your parallel processed time will be (T/N)+x, where N is the number of Processes and x is the overhead. The larger the data, the more negligible x is.

On a side note, I'd suggeste looking into using Pools.

The map function does the chopping for you.

import multiprocessing as mp

def gen_report(user):
    #generate report here
    pass

with mp.Pool(initargs=initargs) as report_generator:
            report_generator.map(func = gen_report, iterable=users)

Upvotes: 4

Related Questions