Reputation: 137
I know I can check by myself if a robots.txt file exists using python and firing an http(s) request. Since Scrapy is checking and downloading it in order to have a spider respect the rules in it, is there a property or method or anything in the Spider class that allows me to know if robots.txt exists for the given website to be crawled?
Tried with crawler stats:
See here
self.crawler.stats.inc_value(f'robotstxt/response_status_count/{response.status}')
I did a couple of tests with websites having and not having robots.txt and I could see the proper info on robots.txt existence. For example, logging self.crawler.stats.__dict__
in my Spider class in my spider_close signal handler I see:
'robotstxt/response_status_count/200': 1
website with robots.txt
'robotstxt/response_status_count/404': 1
website without robots.txt
Well this does not work if during the crawling the spider meets several domains, and the result in stats would be something like:
"robotstxt/response_status_count/200": 1,
"robotstxt/response_status_count/301": 6,
"robotstxt/response_status_count/404": 9,
"robotstxt/response_status_count/403": 1
but I can't map HTTP status code responses to domains...
Upvotes: 1
Views: 682
Reputation: 3720
I don't think so, you would probably have to make a custom middleware based on RobotsTxtMiddleware
. It has the methods _parse_robots
and _robots_error
, you could probably use them to determine if robots.txt existed.
Upvotes: 1