Reputation: 11048
i'm facing a design problem within my project.
PROBLEM
i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)
HOW I'M DOING IT NOW
nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist
no threading involved
HOW I'D LIKE TO TO THIS
i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances
WHAT PROBLEMS I'M EXPERIENCING
slowliness: being single threaded, it now takes a lot to complete (when and if it completes)
connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.
python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)
queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers
WHAT I'M USING
python 2.7
mysql
apache-solr
sunburnt (python interface to solr)
linux box
I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.
I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.
Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!
EDIT BASED ON SOME ANSWERS (will keep this updated)
i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues
divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac
what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.
NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)
Thanks all again!
Upvotes: 1
Views: 985
Reputation: 110561
You can use the stdlib "multiprocessing" module in order to have several subprocesses working with your combinations - This works better than Python's threads, and allow at least each logical CPU core in your configuration to run at the same time.
Here is a minimalist example of how it works:
import random
from multiprocessing import Pool
def a(a):
if random.randint(0, 100000) == 0:
return True
return False
# the number bellow should be a equal to your number of processor cores:
p = Pool(4)
x = any(p.map(a, xrange(1000000)))
print x
So, this makes a 10 million test, divided in 4 "worker" processes, with no scaling issues.
However, given the nature of the error messages you are getting, though you don't explicitly says so, you seem to be running an application with a web interface - and you wait for all the processing to finish before rendering a result to the browser. This tipically won't work with long running calculations - you'd better perform all your calculations in a separate process than the server process serving your web interface, and update the web interface via asynchronous requests, using a little javascript. That way you will avoid any "connection reset by peer" errors.
Upvotes: 1
Reputation: 13189
Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.
Upvotes: 3