RadiantHex
RadiantHex

Reputation: 25557

Concurrent downloads - Python

the plan is this:

I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage.

Problem is that images are downloaded 1 by 1 and this can take quite some time.


It would be great if someone could point me in some direction regarding the topic.

Help would be very much appreciated.

Upvotes: 9

Views: 6773

Answers (4)

Piotr Dobrogost
Piotr Dobrogost

Reputation: 42425

Nowadays there are excellent Python libs you might want to use - urllib3 and requests

Upvotes: 0

rdw
rdw

Reputation: 211

Speeding up crawling is basically Eventlet's main use case. It's deeply fast -- we have an application that has to hit 2,000,000 urls in a few minutes. It makes use of the fastest event interface on your system (epoll, generally), and uses greenthreads (which are built on top of coroutines and are very inexpensive) to make it easy to write.

Here's an example from the docs:

urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
     "https://wiki.secondlife.com/w/images/secondlife.jpg",
     "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]

import eventlet
from eventlet.green import urllib2  

def fetch(url):
  body = urllib2.urlopen(url).read()
  return url, body

pool = eventlet.GreenPool()
for url, body in pool.imap(fetch, urls):
  print "got body from", url, "of length", len(body)

This is a pretty good starting point for developing a more fully-featured crawler. Feel free to pop in to #eventlet on Freenode to ask for help.

[update: I added a more-complex recursive web crawler example to the docs. I swear it was in the works before this question was asked, but the question did finally inspire me to finish it. :)]

Upvotes: 13

Alex Martelli
Alex Martelli

Reputation: 881595

While threading is certainly a possibility, I would instead suggest asyncore -- there's an excellent example here which shows exactly the simultaneous fetching of two URLs (easy to generalize to any list of URLs!).

Upvotes: 6

Matt Anderson
Matt Anderson

Reputation: 19769

Here is an article on threading which uses url fetching as an example.

Upvotes: 4

Related Questions