Reputation: 3500
I want to write a crawler that respects robots.txt. Unfortunately it seems that headless browsers don't support the robots.txt. I had raised a discussion with people from PhantomJS and I got the answer: PhantomJS is a browser, not a crawler, if you use it from a script, the script is responsible for respecting robots.txt.
Is this correct? I was thinking that robots.txt must be respected for each http request and not just the main urls.
So question: does it suffice to just check robots.txt for the main url?
Upvotes: 0
Views: 1425
Reputation: 133995
No, it doesn't suffice to just check robots.txt for the main url. For example, a site might allow bots to crawl HTML pages, but prevent them from accessing images.
That is something of a problem, isn't it? As I understand it, if you ask PhantomJS to visit a Web page, it's going to download not only the page content but also any referenced scripts, images, stylesheets, etc. So your script verified that it's okay to crawl the main url, but it can't know what other urls the web page references.
I would suggest that you look at the PhantomJS API to see if it has a hook where you can filter the urls that it requests. That is, before PhantomJS tries to download an image, for example, it calls the filter to see if it's okay. I don't know if such a function exists, but if it does then that's where you could check robots.txt.
Absent a way for your script to filter the urls that PhantomJS requests, I would suggest finding something else to base your crawler on. Doing otherwise puts you at risk of your crawler accessing files that are specifically prohibited to crawlers by robots.txt.
Upvotes: 3