M. Coppee
M. Coppee

Reputation: 147

Scrapy and respect of robots.txt

I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True).

If I request an URL with scrapy shell url, and if I have a response, does it mean that url is not protected by robots.txt?

Upvotes: 3

Views: 2597

Answers (1)

Marcos
Marcos

Reputation: 675

According to the docs, it's enabled by default only when you create a project using scrapy startproject command, otherwise should be default False.

https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots

Answering your question, yes, scrapy shell command does respect robots.txt configuration defined in settings.py. If ROBOTSTXT_OBEY = True, trying to use scrapy shell command on a protected URL will generate a response None.

You can also test it passing robots.txt settings via command line:

scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"

Upvotes: 3

Related Questions