wlnirvana
wlnirvana

Reputation: 2047

Scrapy: collect retry messages

There is a maxim number of times for the crawler to retr, as documented here. After reaching that, I got an error similar to the following:

Gave up retrying <GET https:/foo/bar/123> (failed 3 times)

I believe the message is produced by the code here.

However, I want do some wrap up about the give-ups. Specifically, I wonder if it is possible to:

  1. Extract the 123 part (an ID) of the URL and write these IDs to a separate file decently.
  2. Access the meta info in the original request. This documentation might be helpful.

Upvotes: 3

Views: 1721

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You can subclass scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and override _retry() to do whatever you want with the request than is given up on.

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy import log

class CustomRetryMiddleware(RetryMiddleware):

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
                    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            # do something with the request: inspect request.meta, look at request.url...
            log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)

Then it's a matter of referencing this custom middleware component in your settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
    'myproject.middlewares.CustomRetryMiddleware': 500,
}

Upvotes: 6

Related Questions