Reputation: 1265
I am experiencing strange behavior in Scrapy. I collect status codes by calling response.status
, by not all of them are present (Seems to be 3xx). I see in the log the following thing:
downloader/response_status_count/200: 8150
downloader/response_status_count/301: 226
downloader/response_status_count/302: 67
downloader/response_status_count/303: 1
downloader/response_status_count/307: 48
downloader/response_status_count/400: 7
downloader/response_status_count/403: 44
downloader/response_status_count/404: 238
downloader/response_status_count/405: 8
downloader/response_status_count/406: 26
downloader/response_status_count/410: 7
downloader/response_status_count/500: 12
downloader/response_status_count/502: 6
downloader/response_status_count/503: 3
whereas my csv file has only 200, 404, 403, 406, 502, 400, 405, 410, 500, 503
. I set HTTPERROR_ALLOW_ALL=True
in the settings.py
. Can I force Scrapy to provide information about redirects? Right know I am taking it from response.meta['redirect_times']
and response.meta['redirect_urls']
, but status code is still 200, instead of 3xx.
Upvotes: 1
Views: 264
Reputation: 21446
30X responses will never reach your callback (parse method) because they are being handles by a redirect middleware before that.
However all of the response statuses are already stored in scrapy stats as you have pointed out yourself which means you can easily pull them in your crawler at any point:
In your callback:
def parse(self, response):
stats = self.crawler.stats.get_stats()
status_stats = {
k: v for k, v in stats.items()
if 'status_count' in k
}
# {'downloader/response_status_count/200': 1}
In your pipeline (see docs for how to use pipelines):
class SaveStatsPipeline:
"""Save response status stats in a stats.json file"""
def close_spider(self, spider):
"""When spider closes save all status stats in a stats.json file"""
stats = spider.crawler.stats.get_stats()
status_stats = {
k: v for k, v in stats.items()
if 'status_count' in k
}
with open('stats.json', 'w') as f:
f.write(json.dumps(status_stats))
Anywhere where you have access to crawler
object really!
Upvotes: 2