Unable to modify request in middleware using Scrapy

Question

I am in the process of scraping public data regarding metheorology for a project (data science), and in order to effectively do that I need to change the proxy used on my scrapy requests in the event of a 403 response code.

For this, I have defined a download middleware to handle such situation, which is as follows

class ProxyMiddleware(object):    
    def process_response(self, request, response, spider):
        if response.status == 403:
            f = open("Proxies.txt")
            proxy = random_line(f) # Just returns a random line from the file with a valid structure ("http://IP:port")
            new_request = Request(url=request.url)
            new_request.meta['proxy'] = proxy
            spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
            return new_request
        return response

After properly adding the class to settings.py, I expected this middleware to deal with 403 responses by generating a new request with the new proxy, hence finishing in a 200 response. The observed behaviour is that it actually gets executed (I can see the Logger info about Changed proxy), but the new request does not seem to be made. Instead, I'm getting this:

2018-12-26 23:33:19 [bot_2] INFO: [Response] Changed proxy to https://154.65.93.126:53281
2018-12-26 23:33:26 [bot_2] INFO: [Response] Changed proxy to https://176.196.84.138:51336

... indefinitely with random proxies, which makes me think that I'm still retrieving 403 errors and the proxy is not changing.

Reading the documentation, regarding process_response, it states:

(...) If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

Is it possible that "in the future" is not "right after it is returned"? How should I do to change the proxy for all requests from that moment on?

VMRuiz · Accepted Answer

Scrapy will drop duplicate requests to the same url by default, so that's probably what's happening on your spider. To check if this is your case you can set this settings:

DUPEFILTER_DEBUG=True
LOG_LEVEL='DEBUG'

To solve this you should add dont_filter=True:

new_request = Request(url=request.url, dont_filter=True)

Unable to modify request in middleware using Scrapy

Answers (2)

Related Questions