Reputation: 4171
I've used some proxies to crawl some website. Here is I did in the settings.py:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOAD_DELAY = 3 # 5,000 ms of delay
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'myspider.comm.rotate_useragent.RotateUserAgentMiddleware' : 100,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
'myspider.comm.random_proxy.RandomProxyMiddleware': 300,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400,
}
And I also have a proxy download middleware which have following methods:
def process_request(self, request, spider):
log('Requesting url %s with proxy %s...' % (request.url, proxy))
def process_response(self, request, response, spider):
log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil'))
def process_exception(self, request, exception, spider):
log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))
#retry again.
return request
Since the proxy is not very stable sometimes, process_exception often prompts a lot of request failure messages. The problem here is that the failed request never been tried again.
As the before shows, I've set the settings RETRY_TIMES and RETRY_HTTP_CODES, and I've also return the request for a retry in the process_exception method of the proxy middle ware.
Why scrapy never retries for the failure request again, or how can I make sure the request is tried at least RETRY_TIMES I've set in the settings.py?
Upvotes: 9
Views: 7354
Reputation: 4171
Thanks for the help from @nyov of Scrapy IRC Channel.
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
'myspider.comm.random_proxy.RandomProxyMiddleware': 300,
Here Retry middleware gets run first, so it will retry the request before it makes it to the Proxy middleware. In my situation, scrapy needs the proxies to crawl the website, or it will timeout endlessly.
So I've reverse the priority between these two download middle wares:
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 300,
'myspider.comm.random_proxy.RandomProxyMiddleware': 200,
Upvotes: 8
Reputation: 11396
it seem that your proxy download middleware -> process_response is not playing by the rules and hence breaking the middlewares chain
process_response() should either: return a Response object, return a Request object or raise a IgnoreRequest exception.
If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.
...
Upvotes: 0