Reputation: 19
I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?
You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you
Upvotes: 1
Views: 3934
Reputation: 11
This can be done by using 2 ways:
robots.txt
User-agent: HTTrack
Disallow: /
.htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^HTTrack
RewriteRule ^.* - [F,L]
Upvotes: 1
Reputation: 4016
It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:
Block by useragent or empty referer
Upvotes: 1