Scrapy: How to tell if robots.txt exists

Question

I know I can check by myself if a robots.txt file exists using python and firing an http(s) request. Since Scrapy is checking and downloading it in order to have a spider respect the rules in it, is there a property or method or anything in the Spider class that allows me to know if robots.txt exists for the given website to be crawled?

Tried with crawler stats:

See here

self.crawler.stats.inc_value(f'robotstxt/response_status_count/{response.status}')

I did a couple of tests with websites having and not having robots.txt and I could see the proper info on robots.txt existence. For example, logging self.crawler.stats.__dict__ in my Spider class in my spider_close signal handler I see:

'robotstxt/response_status_count/200': 1 website with robots.txt 'robotstxt/response_status_count/404': 1 website without robots.txt

Well this does not work if during the crawling the spider meets several domains, and the result in stats would be something like:

"robotstxt/response_status_count/200": 1,
"robotstxt/response_status_count/301": 6,
"robotstxt/response_status_count/404": 9,
"robotstxt/response_status_count/403": 1

but I can't map HTTP status code responses to domains...

Felix Ekl&#246;f · Accepted Answer

I don't think so, you would probably have to make a custom middleware based on RobotsTxtMiddleware. It has the methods _parse_robots and _robots_error, you could probably use them to determine if robots.txt existed.

https://github.com/scrapy/scrapy/blob/e27eff47ac9ae9a9b9c43426ebddd424615df50a/scrapy/downloadermiddlewares/robotstxt.py

Scrapy: How to tell if robots.txt exists

Answers (1)

Related Questions