How do I customize this twisted code?

Question

I am new to python, and even newer to twisted. I am trying to use twisted to download a few hundred thousand files but am having trouble trying to add an errback. I'd like to print the bad url if the download fails. I've misspelled one of my urls on purpose in order to throw an error. However, the code I have just hangs and python doesn't finish (it finishes fine if I remove the errback call).

Also, how to I process each file individually? From my understanding, "finish" is called when everything completes. I'd like to gzip each file when it's downloaded so that it's removed from memory.

Here's what I have:

    urls = [
 'http://www.python.org', 
 'http://stackfsdfsdfdsoverflow.com', # misspelled on purpose to generate an error
 'http://www.twistedmatrix.com', 
 'http://www.google.com',
 'http://launchpad.net',
 'http://github.com',
 'http://bitbucket.org',
]

def finish(results):
    for result in results:
        print 'GOT PAGE', len(result), 'bytes'
    reactor.stop() 
def print_badurls(err):
    print err # how do I just print the bad url????????

waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish).addErrback(print_badurls)

reactor.run()

Jean-Paul Calderone · Accepted Answer

Welcome to Python and Twisted!

There are a few problems with the code you pasted. I'll go through them one at a time.

First, if you do want to download thousands of urls, and will have thousands of items in the urls list, then this line:

waiting = [client.getPage(url) for url in urls]

is going to cause problems. Do you want to try to download every page in the list simultaneously? By default, in general, things you do in Twisted happen concurrently, so this loop starts downloading every URL in the urls list at once. Most likely, this isn't going to work. Your DNS server is going to drop some of the domain lookup requests, your DNS client is going to drop some of the domain lookup responses. The TCP connection attempts to whatever addresses you do get back will compete for whatever network resources are still available, and some of them will time out. The rest of the connections will all trickle along, sharing available bandwidth between dozens or perhaps hundreds of different downloads.

Instead, you probably want to limit the degree of concurrency to perhaps 10 or 20 downloads at a time. I wrote about one approach to this on my blog a while back.

Second, gatherResults returns a Deferred that fires as soon as any one Deferred passed to it fires with a failure. So as soon as any one client.getPage(url) fails - perhaps because of one of the problems I mentioned above, or perhaps because the domain has expired, or the web server happens to be down, or just because of an unfortunate transient network condition, the Deferred returned by gatherResults will fail. finish will be skipped and print_badurls will be called with the error describing the single failed getPage call.

To handle failures from individual HTTP requests, add the callbacks and errbacks to the Deferreds returned from the getPage calls. After adding those callbacks and errbacks, you can use defer.gatherResults to wait for all of the downloads and processing of the download results to be complete.

Third, you might want to consider using a higher-level tool for this - scrapy is a web crawling framework (based on Twisted) that provides lots of cool useful helpers for this kind of application.

How do I customize this twisted code?

Answers (1)

Related Questions