Scrapy and respect of robots.txt

Question

I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True).

If I request an URL with scrapy shell url, and if I have a response, does it mean that url is not protected by robots.txt?

Marcos · Accepted Answer

According to the docs, it's enabled by default only when you create a project using scrapy startproject command, otherwise should be default False.

https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots

Answering your question, yes, scrapy shell command does respect robots.txt configuration defined in settings.py. If ROBOTSTXT_OBEY = True, trying to use scrapy shell command on a protected URL will generate a response None.

You can also test it passing robots.txt settings via command line:

scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"

Scrapy and respect of robots.txt

Answers (1)

Related Questions