asdfkjasdfjk
asdfkjasdfjk

Reputation: 3894

Python: parallelize for loop for database query

I will break down this question into two parts.

I have code similar to this

for data in data_list:
  rslt = query in databae where data == 10 # just some pseudo database query to give example but this single query usually takes around 30-50seconds.
  if rslt.property == 'some condition here':
     return rslt

The conditions here are

  1. We have to return the first element of the data_list that matches the condition after query.
  2. Each database query for each elements take around 30-40s.
  3. data_list is usually very big, around 15-20k elements
  4. Unfortunately we can not do a single database query for whole data_list. We have to do this in loop or one element at a time.

Now my questions are,

  1. How can I optimize this process. Currently this whole process takes around 3-4hrs.
  2. I read about python threading and multiprocessing but I am confused about which one would be appropriate in this case.

Upvotes: 0

Views: 778

Answers (1)

Tomerikoo
Tomerikoo

Reputation: 19414

You could consider using a multiprocessing Pool. You can then use map to send chunks of your iterable to the workers of the Pool to work on according to a given function. So, lets say your query is a function, say, query(data):

def query(data):
    rslt = query in databae where data == 10
    if rslt.property == 'some condition here':
        return rslt

We will use the pool like so:

from multiprocessing import Pool

with Pool() as pool:
    results = pool.map(query, data_list)

Now to your rquierement we will find the first one:

print(next(filter(None, results)))

Note that using the function query this way means that results will be a list of rslts and Nones and we are looking for the first non-None result.


A few Notes:

  1. Note that the Pool's constrctor's first argument is processes which allows you to choose how many processes the pool will hold:

    If processes is None then the number returned by os.cpu_count() is used.

  2. Note that map has aslo the chunksize argument which is default to 1 and allows to choose the size of the chunks passed to the workers:

    This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.

  3. Continuing with map, the docs recommend using imap for large iterables with a specific chunk for better efficiency:

    Note that it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.

    And from the imap docs:

    The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

    So we could actually be more efficient and do:

    chunksize = 100
    processes = 10
    
    with Pool(processes=processes) as pool:
        print(next(filter(None, pool.imap(query, data_list, chunksize=chunksize))))
    

    And here you could play with chunksize and even processes (back from the Pool) and see what combination produces the best results.

  4. If you are interested, you could easily switch to threads instead of processes by simply changing your import statement to:

    from multiprocessing.dummy import Pool
    

    As the docs say:

    multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.


Hope this helps in any way

Upvotes: 1

Related Questions