nginx reverse HTTP proxy - what behavior when all upstreams are unavailable?

Question

In trying to measure and increase our nginx throughput, I noticed that there might be a problem with our configuration, but I'm not sure how to test for it.

We use a simple upstream config, somewhat like this:

upstream myapp1 {
    server srv1.example.com max_fails=1 fail_timeout=3s;
    server srv2.example.com max_fails=1 fail_timeout=3s;
    server srv3.example.com max_fails=1 fail_timeout=3s;
}

When our backends become overloaded, the first upstream may enter unavailable state, and the added load may quickly cause the other backends to fail as well, leaving no available backends for the duration of the fail_timeout setting.

How does nginx behave in this situation? How does it treat the incoming client connections? What errors can I expect to see in the nginx logs?

From OS / netstat monitoring, it appears that nginx attempts to cache these incoming connections until one or more backends returns to available state, at which point .... I'm not sure. Are all waiting connections dumped into the first available backend, likely leading to another overloaded service, and the cycle of fail repeating?

What is the correct behaviour in a situation like this, can (should?) nginx be configured to simply drop / 503 any incoming connections when no backend are available?

Update: upon further research, it appears that nginx will decide whether a backend is available or not based on various settings. Ignoring these settings, is there some way to observe nginx's decision? A log entry perhaps? Anything to confirm what is going on under the hood?

cnst · Accepted Answer

It sounds like you may have deeper architectural issues than simply the nginx front-end.

It is, of course, important to monitor the performance of your front-end server and how it deals with the backends, however, the best idea is to architect your infrastructure in such as way as to avoid the overloads via the front-end in the first place.

The normal reason for a failed upstream scenario is a reboot of the system, or failed physical infrastructure, not a slashdot traffic spike that takes one of your upstreams to its knees, and, subsequently, causing a domino effect with the rest of the upstreams as well.

(TBH, if it's the nominal peak load that could cause one of your upstreams to go down, it's unclear what makes you think that the other ones could possibly remain online regardless of which combination of them nginx will send the leftover clients to, provided that all of them are roughly equal in the capacity that they could handle.)

As such, when designing the architecture, you need to ensure that you have a sufficient number of upstream servers that any one of them going down will not cause overload conditions for the remaining ones. This means each one has to have a reasonable amount of reserve capacity, and, if applicable, handle the errors gracefully itself, too.

Additionally, it's always a good idea to implement failsaves at the front-end to start with — nginx offers http://nginx.org/r/limit_conn and http://nginx.org/r/limit_req, which are there to ensure that an overload condition could be detected at the root. You can combine this with http://nginx.org/r/error_page to catch the errors (possibly using http://nginx.org/r/recursive_error_pages and/or http://nginx.org/r/proxy_intercept_errors, as applicable), and, depending on circumstances, provide either cached versions of your pages (see http://nginx.org/r/proxy_cache), or appropriate error-messages. There's really no limit to the amount of logic you can put into nginx even using the standard syntax and standard directives; it's possible, for example, to detect and handle the slashdot effect directly from within nginx in a completely microservice-like architecture.

As for nginx, it's been tried-and-true in the most demanding and mission-critical applications — http://nginx.org/r/upstream is pretty clear on how the server selection takes place:

By default, requests are distributed between the servers using a weighted round-robin balancing method. … If an error occurs during communication with a server, the request will be passed to the next server, and so on until all of the functioning servers will be tried. If a successful response could not be obtained from any of the servers, the client will receive the result of the communication with the last server.

I'd be surprised if these conditions aren't logged in http://nginx.org/r/error_log, especially depending on the level of logging that you specify. If you have a very big installation, you might also want to look into commercial monitoring solutions, like NGINX Amplify.

nginx reverse HTTP proxy - what behavior when all upstreams are unavailable?

Answers (2)

Related Questions