Reputation: 619
Can someone suggest the steps to check pgsql replication status and how to identify if the replication is not happening properly?
We use streaming replication with pgsql9.0 and pgsql9.4
Upvotes: 50
Views: 230881
Reputation: 32378
If you have small database (~10-100 GB) and reasonably fast network, there's a good chance the database is already streaming files:
On standby:
psql -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp());"
On primary:
psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"
For larger databases (>> 1TB) each replication phase might take significant time, thus it might be important realize in which stage of replication you are at.
When performing full DB replication there are multiple stages:
pg_basebackup
cloning database from primary to a standby server (standby is not running)A) Full backup
Since PostgreSQL 13 you can use pg_stat_progress_basebackup
table, where you can get progress for each replication phase (from primary):
psql -c "select TO_CHAR(backup_streamed::decimal/backup_total*100,'fm00D00%'),phase from pg_stat_progress_basebackup;" -t
31.79% | streaming database files
The replication process has multiple phases, progress is reported per each phase:
initializing
starting file transfer
streaming database files
finishing file transfer
transferring WAL
On earlier PostgreSQL version you're left with plain disk usage:
$ primary=$(du -s /var/lib/postgresql/14/main/base | awk '{print $1}')
$ replica=4648394012 # same command as above only from replica
$ echo ${replica}/${primary}*100 | bc -l
34.90793993287965597200
pg_stat_replication
gives you e.g. replay_lag
which doesn't help much with overall progress. You can only estimate how many WAL you might actually need to keep.
psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"
state | client_hostname | write_lag | replay_lag | flush_lag
-----------+-----------------+-----------------+-----------------+-----------------
backup | | | |
streaming | | 00:00:00.000516 | 12:23:15.991732 | 00:00:00.310261
(2 rows)
B) recovery
On primary server, use pg_stat_replication
table:
psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"
state | client_hostname | write_lag | replay_lag | flush_lag
---------+-----------------+-----------------+-----------------+-----------------
catchup | | 00:04:50.006556 | 00:04:50.006556 | 00:04:50.006556
Standby server has finished the first stage, assuming the server was able to start (correct configuration of critical parameters). Now the postgres
process has to replay accumulated WALs on the standby or copy them from archive (e.g. using archive_command
) in order to reach consistent recovery state. You can obtain the LSN using pg_controldata
:
$ /usr/lib/postgresql/14/bin/pg_controldata /var/lib/postgresql/14/main | grep "Minimum recovery"
Minimum recovery ending location: 4034A/1BFFFB58
NOTE: There's no record in pg_stat_replication
table on the primary, at this stage.
You'll notice in logs something like:
FATAL: the database system is not yet accepting connections
DETAIL: Consistent recovery state has not been yet reached.
LOG: restored log file "000000020004022E00000029" from archive
The HEX number points to a WAL file, that can be dumped (typically single line is enough)
/usr/lib/postgresql/14/bin/pg_waldump 14/main/pg_wal/00000002000402340000003D -n 1
rmgr: Heap len (rec/tot): 48/ 48, tx: 1350166609, lsn: 40234/F4001C08, prev 40234/F3FFFDB0, desc: HEAP_CONFIRM off 52, blkref #0: rel 1663/16421/2324630632 blk 85514
we're looking for the LSN part, e.g. lsn: 40234/F4001C08
. Now, when we get back to the primary server, we can obtain the lag in file size units:
psql -c "SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '40234/F4001C08'));" -t
1813 GB
1813 GB
is the diff between current WAL and the one being replayed.
Another option is to use pg_controldata
$ /usr/lib/postgresql/14/bin/pg_controldata /var/lib/postgresql/14/main | grep location
Latest checkpoint's REDO location: 40335/C001D040
where you can obtain following LSN pointers:
Backup start location
Latest checkpoint's REDO location
- current recovery positionMinimum recovery ending location
C) WAL streaming
Primary and standby are in almost the same state, both servers are accepting connections. Except there might be some lag between those two servers, which might be caused by frequency of writes to WAL, network transfer speed, etc.
On standby:
psql -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp());"
On primary:
psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"
This will give you time difference between two servers.
Upvotes: -1
Reputation: 2581
On your master, pg_stat_replication provides data about ongoing replication:
select client_addr, state, sent_location, write_location,
flush_location, replay_location from pg_stat_replication;
On postgresql v10:
select client_addr, state, sent_lsn, write_lsn,
flush_lsn, replay_lsn from pg_stat_replication;
Upvotes: 31
Reputation: 2164
I use following SQL queries to check status on Postgres v11 usually.
On master:
select * from pg_stat_replication;
On replica (streaming replication in my case):
select * from pg_stat_wal_receiver;
Upvotes: 84
Reputation: 419
Show replication status in PostgreSQL
postgres=# select usename,application_name,client_addr,backend_start,state,sync_state from pg_stat_replication ;
usename | application_name | client_addr | backend_start | state | sync_state
------------+------------------+----------------+-------------------------------+-----------+------------
replicator | walreceiver | 192.168.10.132 | 2018-07-06 06:12:20.786918+03 | streaming | async
(1 row)
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
0/540C1DB8
postgres=# select pg_last_xlog_replay_location();
pg_last_xlog_replay_location
------------------------------
0/540C1DB8
(1 row)
postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
log_delay
-----------
0
(1 row)
Upvotes: 30