user2099762
user2099762

Reputation: 183

Web crawlers overloading site

We have a problem with a number of our sites where Yahoo, Google, Yandex, Bing Ahrefs and others all index the site at the same time which kills the website.

I've configured fail2ban to block the source IPs, but these are forever changing so not ideal. I have also tried using robots.txt but this makes little difference.

We have tried putting the site behind cloudflare, but again this makes little difference and all we can do is block the source IPs.

What else can I do?

Currently we're monitoring the site with Nagios, which restarts nginx when the site becomes un-responsive, but this seems far from ideal.

Ubuntu server running nginx

Robots.txt file is here:-

User-agent: *
Disallow: /

Posting here in case there is anything that I can get our developers to try.

Thanks

Upvotes: 1

Views: 2426

Answers (2)

Alexander Azarov
Alexander Azarov

Reputation: 13221

An easy approach is to rate limit them based on User-Agent header in their requests. Schematically this looks like the following.

At http level in Nginx configuration:

map $http_user_agent $bot_ua {
  default '';

  "~*Googlebot|Bing" Y;
}

limit_req_zone $bot_ua zone=bot:1m rate=1r/s;

This will make sure that all requests with Googlebot or Bing in User-Agent will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP), i.e. all of the bots will wait in a single queue to access the web site. The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.

At server or location level:

limit_req zone=bot burst=5;

Which means a "burst" of 5 requests is possible. You may drop this option if you want.

Nginx will issue HTTP status code 429 when a request is rate limited. "Sane" web crawlers detect this and slow down scanning the site.


Though I should say this whole problem is much more complicated. There are a lot of malicious requests pretending they are from Google, Twitter, FB, etc. coming from various scanners and crawlers (e.g. see this question) and they respect neither robots.txt nor 429. Sometimes they are quite smart and have User-Agents mimicing browsers. In this case the approach above will not help you.

Upvotes: 2

Mike Waters
Mike Waters

Reputation: 103

Your robots.txt should have worked. And note that not all crawlers respect robots.txt.

robots.txt is case-sensitive, and it needs to be world-readable at www.yourdomain.com/robots.txt.

See what happens when you add Crawl-delay: 10.

Upvotes: 0

Related Questions