Reputation: 40952
I'm wanting to achieve multithreading in python where the threaded function does some actions and adds a URL to a list of URLs (links
) and a listener watches the links
list from the calling script for new elements to iterate over. Confused? Me too, I'm not even sure how to go about explaining this, so let me try to demonstrate with pseudo-code:
from multiprocessing import Pool
def worker(links):
#do lots of things with urllib2 including finding elements with BeautifulSoup
#extracting text from those elements and using it to compile the unique URL
#finally, append a url that was gathered in the `lots of things` section to a list
links.append( `http://myUniqueURL.com` ) #this will be unique for each time `worker` is called
links = []
for i in MyBigListOfJunk:
Pool().apply(worker, links)
for link in links:
#do a bunch of stuff with this link including using it to retrieve the html source with urllib2
Now, rather than waiting for all the worker
threads to finish and iterate over links
all at once, is there a way for me to iterate over the URLs as they are getting appended to the links
list? Basically, the worker
iteration to generate the links
list HAS to be separate from the iteration of links
itself; however, rather than running each sequentially I was hoping I could run them somewhat concurrently and save some time... currently I must call worker
upwards of 30-40 times within a loop and the entire script takes roughly 20 minutes to finish executing...
Any thoughts would be very welcome, thank you.
Upvotes: 1
Views: 87
Reputation:
You should use Queue
class for this. It is a thread-safe array. It's 'get' function removes item from Queue, and, what's important, blocks when there is no items and waits until other processes add them.
If you use multiprocessing
than you should use Queue
from this module, not the Queue
module.
Next time you ask questions on processes, provide exact Python version you want it for. This is for 2.6
Upvotes: 1