Alexander
Alexander

Reputation: 494

Nginx to cache only for bots

I have a decent website (nginx -> apache -> mod_php/mysql) to tune it a bit, and I find the biggest problem is that search bots used to overload it sending many requests at once.

There is a cache in site's core (that is, in PHP), so the site's author reported there should be no problem but in fact the bottleneck is that apache's reply is too long as there is too many requests for the page.

What I can imagine is to have some nginx based cache to cache pages only for bots. The TTL may be high enough (there is nothing that dynamic on page that can't wait another 5-10 minutes to be refreshed) Let's define 'bot' as any client that have 'Bot' in its UA string ('BingBot' as an example).

So I try to do something like that:

map $http_user_agent $isCache {
 default 0;
 ~*(google|bing|msnbot) 1;
 } 

proxy_cache_path /path/to/cache levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m use_temp_path=off;

server {
    ...
    location / {
        proxy_cache my_cache;
        proxy_cache_bypass $isCache;
        proxy_cache_min_uses 3;
        proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
        proxy_cache_lock on;
        proxy_pass http://my_upstream;
    }
    # location for images goes here
}

Am I right with my approach? Looks like it won't work.

Any other approaches to limit load from bots? Surely without sending 5xx codes to them (as Search Engines can lower positions for sites that are too 5xx-ed).

Thank you!

Upvotes: 2

Views: 1031

Answers (1)

Troy Morehouse
Troy Morehouse

Reputation: 5435

If your content pages may differ (i.e. say a user is logged in and it the page contains "welcome John Doe", then that version of the page may be cached, as each request is updating the cached copy (i.e. a logged in person will update the cached version, including their session cookies, which is bad).

It is best to do something similar to the following:

map $http_user_agent $isNotBot {
  ~*bot    "";
  default  "IAmNotARobot";
}

server {
  ...
  location / {
    ...
    # Bypass the cache for humans
    proxy_cache_bypass $isNotBot;
    # Don't cache copies of requests from humans
    proxy_no_cache     $isNotBot;
    ...
  }
  ...
}

This way, only requests by a bot are cached for future bot requests, and only bots are served cached pages

Upvotes: 1

Related Questions