Reputation: 183
We have a problem with a number of our sites where Yahoo, Google, Yandex, Bing Ahrefs and others all index the site at the same time which kills the website.
I've configured fail2ban to block the source IPs, but these are forever changing so not ideal. I have also tried using robots.txt but this makes little difference.
We have tried putting the site behind cloudflare, but again this makes little difference and all we can do is block the source IPs.
What else can I do?
Currently we're monitoring the site with Nagios, which restarts nginx when the site becomes un-responsive, but this seems far from ideal.
Ubuntu server running nginx
Robots.txt file is here:-
User-agent: *
Disallow: /
Posting here in case there is anything that I can get our developers to try.
Thanks
Upvotes: 1
Views: 2426
Reputation: 13221
An easy approach is to rate limit them based on User-Agent
header in their requests. Schematically this looks like the following.
At http
level in Nginx configuration:
map $http_user_agent $bot_ua {
default '';
"~*Googlebot|Bing" Y;
}
limit_req_zone $bot_ua zone=bot:1m rate=1r/s;
This will make sure that all requests with Googlebot
or Bing
in User-Agent
will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP), i.e. all of the bots will wait in a single queue to access the web site. The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.
At server
or location
level:
limit_req zone=bot burst=5;
Which means a "burst" of 5 requests is possible. You may drop this option if you want.
Nginx will issue HTTP status code 429 when a request is rate limited. "Sane" web crawlers detect this and slow down scanning the site.
Though I should say this whole problem is much more complicated. There are a lot of malicious requests pretending they are from Google, Twitter, FB, etc. coming from various scanners and crawlers (e.g. see this question) and they respect neither robots.txt
nor 429. Sometimes they are quite smart and have User-Agents
mimicing browsers. In this case the approach above will not help you.
Upvotes: 2
Reputation: 103
Your robots.txt should have worked. And note that not all crawlers respect robots.txt.
robots.txt is case-sensitive, and it needs to be world-readable at www.yourdomain.com/robots.txt.
See what happens when you add Crawl-delay: 10
.
Upvotes: 0