Postgres 13 taking long to shut down without any significant logs

Question

TL;DR : PG taking long to shut down and I am unable to find the root cause or reproduce the issue.

PG major version : PG 13.

Full issue : For an operation, our workflow performs a couple of checkpoints before issuing the SIGINT to shutdown the DB. Once the shutdown with SIGINT(fast shutdown) is issued, I see another checkpoint happening on the PG instance in question. Following this, I notice the instance is still not able to shut down completely for ~4hours. By this, I mean that I don't see the engine log "database system is shut down" which is generally the case when it has been successfully shut down. After the checkpoint completed successfully, all that I am seeing in the logs is the below in loop for the 4 hours.

connection received: host=<> port=<>
the database system is shutting down(ProcessStartupPacket)

I believe this log is from client apps trying to connect and being refused connections since a SIGINT was issued and is not indicative of real reason of stuck shutdown.

I am trying to understand what could have prevented PG from shutting down ? This being a critical server, I am constrained by not being able to turn on log_min_messages to 'DEBUG5' and attempt another shutdown to see it go into a similar fate. On the other hand, I am not sure how I can repro this issue in my environment.

As a long shot, I "assumed" if something was going on with archiving that could have caused this. But even by running pgbench with 10 connections for a significant amount of time with inserts, updates and long running queries, I am not able to repro a slow shutdown.

Another aspect that I was considering exploring was to accumulate a lot of WAL files to see if archiving could indeed be the reason. But the pgbench experiment did not help much with that. Is there a way by which I can accumulate a lot of WAL files ( tried increasing the checkpoint_timeout to the max possible value, but did not help).

To summarize, below are the questions I am looking help with :

Beyond the logs, how can I find out why the shutdown took long?
Any suggestion on how I can repro this ?
How can I consider accumulating a lot of WAL files in my test server ?

Postgres 13 taking long to shut down without any significant logs

Answers (1)

Ad 1.

Ad 2.

Ad 3.

Related Questions