Sasha Chedygov
Sasha Chedygov

Reputation: 130787

Multi-step, concurrent HTTP requests in Python

I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?

Up until this point, I've been doing one-step scraping with eventlet, like this:

urls = ['http://example.com', '...']
def scrape_page(url):
    """Gets the data from the web page."""
    body = eventlet.green.urllib2.urlopen(url).read()
    # Do something with body
    return data

pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
    # Handle the data...

However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.

I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.

Upvotes: 0

Views: 1339

Answers (1)

jdi
jdi

Reputation: 92559

Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.

Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.

Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.

Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.

Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.

Post note

Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:

http://eventlet.net/doc/examples.html#producer-consumer-web-crawler

In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.

Upvotes: 3

Related Questions