tmowbray
tmowbray

Reputation: 13

postgresql replication | application freezes when slave/recovery server goes down

I support an application hosted by a small business, web-based ROR app using pgsql database on the backend.

Postgres is setup for replication to an off-site standby server, which as far as I can tell is working fine, when I query the remote server it shows that it's in recovery, etc.

From the 'master' server:

postgres=# table pg_stat_replication ;
  pid  | usesysid | usename | application_name |  client_addr   | client_hostname | client_port |         backend_start
        |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+----------------+-----------------+-------------+-----------------------
--------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 18660 |  1281085 | rep     | postgresql2      | 192.168.81.155 |                 |       43824 | 2017-05-07 11:42:43.15
0057-04 | streaming | 3/B5243418    | 3/B5243418     | 3/B5243418     | 3/B5243150      |             1 | sync
(1 row)

...and on the 'slave':

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
 replication_delay
-------------------
 01:02:14.885511
(1 row)

I understand the process involved should I have to promote my remote slave DB to the role of master, but the problem I seem to have is that on 2 or 3 occasions now the network link to the remote slave server has gone down, and the application completely "freezes up" (e.g. page loads but will not allow users to logon), despite the fact that the master DB is still up and running. I have wal archiving enabled to make sure that when something like this happens the data is preserved until the link is restored and the transaction logs can be sent...but I don't understand why my master pgsql instance seems to lockup because the slave instance goes offline...kind of defeats the entire concept of replication, so I assume I must be doing something wrong?

Upvotes: 1

Views: 1305

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247625

The most likely explanation is that you are using synchronous replication with just two nodes.

Is synchronous_standby_names set on the master server?

If the only synchronous standby server is not available, no transaction can commit on the master, and data modifying transactions will “hang”, which would explain the behaviour you observe.

For synchronous replication you need at lest two slaves.

Upvotes: 2

Related Questions