Reputation: 3894
I will break down this question into two parts.
I have code similar to this
for data in data_list:
rslt = query in databae where data == 10 # just some pseudo database query to give example but this single query usually takes around 30-50seconds.
if rslt.property == 'some condition here':
return rslt
The conditions here are
data_list
that matches the condition after query.data_list
is usually very big, around 15-20k elementsdata_list
. We have to do this in loop or one element at a time.Now my questions are,
Upvotes: 0
Views: 778
Reputation: 19414
You could consider using a multiprocessing Pool
. You can then use map
to send chunks of your iterable to the workers of the Pool
to work on according to a given function. So, lets say your query is a function, say, query(data)
:
def query(data):
rslt = query in databae where data == 10
if rslt.property == 'some condition here':
return rslt
We will use the pool like so:
from multiprocessing import Pool
with Pool() as pool:
results = pool.map(query, data_list)
Now to your rquierement we will find the first one:
print(next(filter(None, results)))
Note that using the function query
this way means that results
will be a list of rslt
s and None
s and we are looking for the first non-None
result.
A few Notes:
Pool
's constrctor's first argument is processes
which allows you to choose how many processes the pool will hold:
If processes is
None
then the number returned by os.cpu_count() is used.
map
has aslo the chunksize
argument which is default to 1 and allows to choose the size of the chunks passed to the workers:
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
Continuing with map
, the docs recommend using imap
for large iterables with a specific chunk for better efficiency:
Note that it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.
And from the imap
docs:
The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of
1
.
So we could actually be more efficient and do:
chunksize = 100
processes = 10
with Pool(processes=processes) as pool:
print(next(filter(None, pool.imap(query, data_list, chunksize=chunksize))))
And here you could play with chunksize
and even processes
(back from the Pool
) and see what combination produces the best results.
If you are interested, you could easily switch to threads instead of processes by simply changing your import statement to:
from multiprocessing.dummy import Pool
As the docs say:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Hope this helps in any way
Upvotes: 1