dabadaba
dabadaba

Reputation: 9532

Where to even begin investigating issue causing database crash: remaining connection slots are reserved for non-replication superuser connections

Occasionally our Postgres database crashes and it can only be solved by restarting the server. We have tried incrementing the max connections and Django's CONN_MAX_AGE. Also, I am trying to learn how to set up PgBouncer. However, I am convinced the underlying issue must be something else which is fixable.

I am trying to find what that issue is. The problem is I wouldn't know where or what to begin to look at. Here are some pieces of information:

The errors are always OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections and OperationalError: could not write to hash-join temporary file: No space left on device. I think this is caused by opening too many database connections, but I have never managed to catch this going down live so that I could inspect pg_stat_activity and see what actual connections were active.

Looking at the error log, the same URL shows up for the most part. I've checked the nginx log and it's listed in many different lines, meaning the request is being made multiple times at once rather than Django logging the same error multiple times. All these requests are responded with 499 Client Closed Request. In addition to this same URL, there are of course sprinkled requests of other users trying to access our site.

I should mention that the logic the server processes when the URL in question is requested is pretty simple and I see nothing suspicious that could cause a database crash. However, for some reason, the page loads slowly in production.

I know this is very vague and very little to work with, but I am not used to working sysadmin, I only studied this kind of thing in college and so far I've only worked as a developer.

Upvotes: 0

Views: 862

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247235

Those two problems are mostly independent.

Running out of connection slots won't crash the database. It just is a sign that you either don't use a connection pool or you have a connection leak, i.e. you forget to close transactions in your code.

Running out of space will crash your database if the condition persists.

I assume that the following happens in your system:

  • Because someone forgot a couple of join conditions or for some other reason, some of your queries take a very long time.

    They also priduce a lot of (perhaps intermediate) results that are cached in temporary files that eventually fill up the disk. This out of space condition is cleared as soon as the query fails, but it can crash the database.

  • Because these queries take long and block a database session, your application keeps starting new sessions until it reaches the limit.

Solution:

  • Find and fix thise runaway queries. As a stop-gap, you can set statement_timeout to terminate all statements that take too long.

  • Use a connection pool with an upper limit so that you don't run out of connections.

Upvotes: 1

Related Questions