John Carrell
John Carrell

Reputation: 2000

Nginx reverse proxying to Django receiving `upstream prematurely closed connection while reading response header from upstream`

TL;DR

What http/tcp phenomenon is occurring when Nginx logs upstream prematurely closed connection while reading response header from upstream when attempting to reverse proxy over HTTP to a local Django instance (with no WSGI middleware)?

Long Version

At the risk of infuriating the community I'm not going to include any config as, while I'm sure it's relevant, I'm trying to understand the theory behind this phenomenon.

Some teammates and I maintain a webserver for internal use. In the/our world of internal tools things are never productized. We typically are doing whatever is necessary to deliver some value to our co-workers. Stakes and available resources are low.

As such we've committed a cardinal sin in standing up a Python 2 Django server on its own. No WSGI middleware, no additional processes. I've seen the admonitions but we've done what we've done.

I recently stood up an Nginx instance in front of this abomination to give us the ability to "hot-swap" instances of our web application with zero downtime. I still did not insert anything in between. Nginx is simply reverse-proxying, over localhost http to the Django instance listening on a localhost non-standard port.

After this change we started seeing bursts of 502s from Nginx. There are a few pages that are "live" in that they do some polling to check for updates to things. As such, there is "a lot" of traffic for the number of users we have.

I actually think whatever the problem was already existed before the introduction of Nginx but that, since the browser got the error directly it simply retried and the hiccup was invisible to the user, whereas now they get an ugly 502 error message.

Now for the question: If I see in the Nginx error.log upstream prematurely closed connection while reading response header from upstream what does that actually mean? I've seen lots of threads on this site with suggestions for config changes, none of which seem to apply to me, but what I'm looking for is the theory.

What does that error mean? What exactly is Nginx experiencing when it tries to proxy a request to Django? Is Django refusing connections? Is Django closing connections before they are finished?

If Django is doing these things, why? Is it out of memory, threads, is there some reason why it would have a limit on the number of threads, etc?

As a blind attempt at a temporary, over-the-weekend fix I stood up a second instance of the application and configured Nginx to round-robin load balance to them. It seems to have worked but I won't be sure until Monday morning when peak load ensues.

The second instance was on the same box so there can't be any additional system resources. Is there some resource in a Python interpreter instance that is running out such that creating a second instance gives me "twice" the capacity?

I'm really trying to learn something valuable here beyond "throw more resources at it!"

Any help would be appreciated. Thanks, in advance!

UPDATE

Philipp, thank you so much for your thorough answer! One quick question to lock in my understanding...

If my upstream Python server "cannot handle enough requests in parallel and blocks" what could be the cause of this. It is a single process so that sort of simplifies the issue, I would think. What resource would be running out? Isn't the server likely just reading off of a socket at whatever speed it can accommodate? What system/server configuration would dictate the number of in-flight requests it can handle at a time? I looked pretty thoroughly and couldn't find any explicit Django (the Python server library) config options that would artificially limit its responsiveness. I can certainly stand up additional resources but if it's more of a system limitation then I wouldn't expect another instance on the same box to do anything (which is what I'm now expecting as a second instance began producing the same problem over the weekend). I'd like to make a calculated decision here once and for all.

Thank you, again for your (or anyone else's) help!

UPDATE 2

The underlying issue (as described to me by a Linux kernel-savvy co-worker once I got in on Monday morning) is the LISTEN QUEUE DEPTH.

It is this construct that is limited in its capacity. When a process listens on a port and new connection attempts come in (before the connection has been established) the LISTEN QUEUE is what builds up if the process is establishing connections more slowly than they are coming in.

So, it's not about memory or CPU (unless a shortage of these resources is the reason for the slow connection establishment) but a constraint on a process's capacity for connections.

I am, by no means an expert on any of this but it was this construct that I was after as to why a given process suddenly decides (or the OS decides for it) that it will accept no more connections.

More can be read here.

Thanks, again, to Philipp for leading me down the correct path!

Upvotes: 4

Views: 9135

Answers (2)

Philipp Claßen
Philipp Claßen

Reputation: 44009

upstream prematurely closed connection while reading response header from upstream

That error is definitely on the upstream, that mean in your case, the connection to your Python servers. 502 indicates that a TCP connection from Nginx to one of its upstream servers was closed (actively closed from the python process, or by the system as it timed out).

From what you describe, it could be that the Python servers cannot handle enough requests in parallel and blocks. As long as you did not have Nginx in front, you would not notice, maybe only that the requests are slow. With Nginx in front, it changes, as Nginx can easily handle lots of requests and might accept more requests then its upstream servers (i.e., your Python servers) can keep up with. In that situation, the upstream server does not respond, and eventually the socket gets closed, which forces Nginx to fail with 502 (Bad Gateway).

To test the theory, you compare what happens when you make multiple requests either to Nginx, or directly to the Python servers. If requests are getting blocked and served slower (but without errors) when you go directly to the Python servers, but are all immediately accepted (but some are failing with 502) when you go to Nginx, it could be the situation that I described.

In that case, you can try a few things:

  • Make sure that keep-alive works on Nginx (that is a good idea anyway, and should limit the number of parallel requests to upstream). For details, see this answer.
  • (If possible) change the Python servers, so they can handle more parallel requests
  • Make sure that you are not running out of file handles on your server, and monitor the number of TPC sockets on the system (e.g., with sudo netstat -tulpan).

I could be wrong, as I did a lot of guessing in my answer. Still, I hope it gives you some ideas for troubleshooting why requests are closed (or time out).

Upvotes: 6

Rômulo Collopy
Rômulo Collopy

Reputation: 1044

Probably you've checked that already, but I've come across a similar problem yesterday and the problem was: I was running uwsgi with --http-socket :5000. I've changed to --socket :5000 and it worked perfectly.

I hope this can help someone.

Upvotes: 1

Related Questions