Reputation: 173
TL;DR : PG taking long to shut down and I am unable to find the root cause or reproduce the issue.
PG major version : PG 13.
Full issue : For an operation, our workflow performs a couple of checkpoints before issuing the SIGINT to shutdown the DB. Once the shutdown with SIGINT(fast shutdown) is issued, I see another checkpoint happening on the PG instance in question. Following this, I notice the instance is still not able to shut down completely for ~4hours. By this, I mean that I don't see the engine log "database system is shut down" which is generally the case when it has been successfully shut down. After the checkpoint completed successfully, all that I am seeing in the logs is the below in loop for the 4 hours.
connection received: host=<> port=<>
the database system is shutting down(ProcessStartupPacket)
I believe this log is from client apps trying to connect and being refused connections since a SIGINT was issued and is not indicative of real reason of stuck shutdown.
I am trying to understand what could have prevented PG from shutting down ? This being a critical server, I am constrained by not being able to turn on log_min_messages
to 'DEBUG5
' and attempt another shutdown to see it go into a similar fate. On the other hand, I am not sure how I can repro this issue in my environment.
As a long shot, I "assumed" if something was going on with archiving that could have caused this. But even by running pgbench with 10 connections for a significant amount of time with inserts, updates and long running queries, I am not able to repro a slow shutdown.
Another aspect that I was considering exploring was to accumulate a lot of WAL files to see if archiving could indeed be the reason. But the pgbench experiment did not help much with that. Is there a way by which I can accumulate a lot of WAL files ( tried increasing the checkpoint_timeout
to the max possible value, but did not help).
To summarize, below are the questions I am looking help with :
Upvotes: 0
Views: 1332
Reputation: 246653
SIGINT
causes a “fast shutdown”, so that was correct.
You cannot. You have to use the logs. I recommend enabling log_checkpoints
to see how long the final checkpoint took and what it did. You are correct to assume that if you run a checkpoint right before you shut down, the shutdown checkpoint should be fast, as it has little work to do.
Of course you can always break into the postmaster process with a debugger and see what it is doing, but you probably don't want to do that on a productive system either.
But there is nothing that keeps you from increasing log_min_messages
right before you shout down. That way, you can get more in the logs during the shutdown.
Not unless you can reproduce your productive workload on a test system.
That is simple: write a slow archive_command
that sleeps for a few seconds before archiving.
You are not asking the crucial question: what could cause that behavior? I didn't read the code, but one obvious place is indicated by this part of the documentation:
The server disallows new connections and sends all existing server processes SIGTERM, which will cause them to abort their current transactions and exit promptly. It then waits for all server processes to exit and finally shuts down.
So it could be a backend process that is performing a lengthy operation
without regularly performing CHECK_FOR_INTERRUPTS()
. That could be a third-party extension or it could be a PostgreSQL bug. It can also be caused by data corruption and an entailing endless loop, but that is unlikely, since you say that shutdown finishes eventually. Watch out for any client backends that stay around for a longer time, and check how long after the shutdown request you see the log message that the checkpoint is starting.
Upvotes: 2