Bociek
Bociek

Reputation: 1265

Scrapy redirect is always 200

I am experiencing strange behavior in Scrapy. I collect status codes by calling response.status, by not all of them are present (Seems to be 3xx). I see in the log the following thing:

downloader/response_status_count/200: 8150
downloader/response_status_count/301: 226
downloader/response_status_count/302: 67
downloader/response_status_count/303: 1
downloader/response_status_count/307: 48
downloader/response_status_count/400: 7
downloader/response_status_count/403: 44
downloader/response_status_count/404: 238
downloader/response_status_count/405: 8
downloader/response_status_count/406: 26
downloader/response_status_count/410: 7
downloader/response_status_count/500: 12
downloader/response_status_count/502: 6
downloader/response_status_count/503: 3

whereas my csv file has only 200, 404, 403, 406, 502, 400, 405, 410, 500, 503. I set HTTPERROR_ALLOW_ALL=True in the settings.py. Can I force Scrapy to provide information about redirects? Right know I am taking it from response.meta['redirect_times'] and response.meta['redirect_urls'], but status code is still 200, instead of 3xx.

Upvotes: 1

Views: 264

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

30X responses will never reach your callback (parse method) because they are being handles by a redirect middleware before that.

However all of the response statuses are already stored in scrapy stats as you have pointed out yourself which means you can easily pull them in your crawler at any point:

  1. In your callback:

    def parse(self, response):
        stats = self.crawler.stats.get_stats()
        status_stats = {
            k: v for k, v in stats.items() 
            if 'status_count' in k
        }
        # {'downloader/response_status_count/200': 1}
    
  2. In your pipeline (see docs for how to use pipelines):

    class SaveStatsPipeline:
        """Save response status stats in a stats.json file"""
    
        def close_spider(self, spider):
            """When spider closes save all status stats in a stats.json file"""
            stats = spider.crawler.stats.get_stats()
            status_stats = {
                k: v for k, v in stats.items() 
                if 'status_count' in k
            }
            with open('stats.json', 'w') as f:
                f.write(json.dumps(status_stats))
    

Anywhere where you have access to crawler object really!

Upvotes: 2

Related Questions