Reputation: 147
I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True
).
If I request an URL with scrapy shell url
, and if I have a response, does it mean that url
is not protected by robots.txt?
Upvotes: 3
Views: 2597
Reputation: 675
According to the docs, it's enabled by default only when you create a project using scrapy startproject
command, otherwise should be default False
.
https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots
Answering your question, yes, scrapy shell
command does respect robots.txt
configuration defined in settings.py
. If ROBOTSTXT_OBEY = True
, trying to use scrapy shell
command on a protected URL will generate a response None
.
You can also test it passing robots.txt settings via command line:
scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"
Upvotes: 3