Dino Daniel
Dino Daniel

Reputation: 619

Check postgres replication status

Can someone suggest the steps to check pgsql replication status and how to identify if the replication is not happening properly?

We use streaming replication with pgsql9.0 and pgsql9.4

Upvotes: 50

Views: 230881

Answers (4)

Tombart
Tombart

Reputation: 32378

Short answer

If you have small database (~10-100 GB) and reasonably fast network, there's a good chance the database is already streaming files:

On standby:

psql -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp());"

On primary:

psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"

Long answer

For larger databases (>> 1TB) each replication phase might take significant time, thus it might be important realize in which stage of replication you are at.

When performing full DB replication there are multiple stages:

  • A) full backup - pg_basebackup cloning database from primary to a standby server (standby is not running)
  • B) recovery - standby server starting up (not accepting connections yet)
  • C) WAL streaming - Both primary and standby(s) in consistent state and accepting connections

A) Full backup

Since PostgreSQL 13 you can use pg_stat_progress_basebackup table, where you can get progress for each replication phase (from primary):

psql -c "select TO_CHAR(backup_streamed::decimal/backup_total*100,'fm00D00%'),phase from pg_stat_progress_basebackup;" -t
 31.79%  | streaming database files

The replication process has multiple phases, progress is reported per each phase:

  1. initializing
  2. starting file transfer
  3. streaming database files
  4. finishing file transfer
  5. transferring WAL

On earlier PostgreSQL version you're left with plain disk usage:

$ primary=$(du -s /var/lib/postgresql/14/main/base | awk '{print $1}')
$ replica=4648394012 # same command as above only from replica
$ echo ${replica}/${primary}*100 | bc -l
34.90793993287965597200

pg_stat_replication gives you e.g. replay_lag which doesn't help much with overall progress. You can only estimate how many WAL you might actually need to keep.

psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"
   state   | client_hostname |    write_lag    |   replay_lag    |    flush_lag    
-----------+-----------------+-----------------+-----------------+-----------------
 backup    |                 |                 |                 | 
 streaming |                 | 00:00:00.000516 | 12:23:15.991732 | 00:00:00.310261
(2 rows)

B) recovery

On primary server, use pg_stat_replication table:

psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"

  state  | client_hostname |    write_lag    |   replay_lag    |    flush_lag    
---------+-----------------+-----------------+-----------------+-----------------
 catchup |                 | 00:04:50.006556 | 00:04:50.006556 | 00:04:50.006556

Standby server has finished the first stage, assuming the server was able to start (correct configuration of critical parameters). Now the postgres process has to replay accumulated WALs on the standby or copy them from archive (e.g. using archive_command) in order to reach consistent recovery state. You can obtain the LSN using pg_controldata:

$ /usr/lib/postgresql/14/bin/pg_controldata /var/lib/postgresql/14/main | grep "Minimum recovery"
Minimum recovery ending location:     4034A/1BFFFB58

NOTE: There's no record in pg_stat_replication table on the primary, at this stage.

You'll notice in logs something like:

FATAL:  the database system is not yet accepting connections
DETAIL:  Consistent recovery state has not been yet reached.
LOG:  restored log file "000000020004022E00000029" from archive

The HEX number points to a WAL file, that can be dumped (typically single line is enough)

/usr/lib/postgresql/14/bin/pg_waldump 14/main/pg_wal/00000002000402340000003D -n 1
rmgr: Heap        len (rec/tot):     48/    48, tx: 1350166609, lsn: 40234/F4001C08, prev 40234/F3FFFDB0, desc: HEAP_CONFIRM off 52, blkref #0: rel 1663/16421/2324630632 blk 85514

we're looking for the LSN part, e.g. lsn: 40234/F4001C08. Now, when we get back to the primary server, we can obtain the lag in file size units:

 psql -c "SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '40234/F4001C08'));" -t
 1813 GB

1813 GB is the diff between current WAL and the one being replayed.

Another option is to use pg_controldata

$ /usr/lib/postgresql/14/bin/pg_controldata /var/lib/postgresql/14/main | grep location
Latest checkpoint's REDO location:    40335/C001D040

where you can obtain following LSN pointers:

  • Backup start location
  • Latest checkpoint's REDO location - current recovery position
  • Minimum recovery ending location

C) WAL streaming

Primary and standby are in almost the same state, both servers are accepting connections. Except there might be some lag between those two servers, which might be caused by frequency of writes to WAL, network transfer speed, etc.

On standby:

psql -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp());"

On primary:

psql -c "select state, client_hostname, write_lag, replay_lag, flush_lag from pg_stat_replication;"

This will give you time difference between two servers.

Upvotes: -1

Reinsbrain
Reinsbrain

Reputation: 2581

On your master, pg_stat_replication provides data about ongoing replication:

select client_addr, state, sent_location, write_location,
        flush_location, replay_location from pg_stat_replication;

On postgresql v10:

select client_addr, state, sent_lsn, write_lsn,
    flush_lsn, replay_lsn from pg_stat_replication;

Upvotes: 31

Alexey
Alexey

Reputation: 2164

I use following SQL queries to check status on Postgres v11 usually.

On master:

select * from pg_stat_replication;

On replica (streaming replication in my case):

select * from pg_stat_wal_receiver;

Upvotes: 84

truth
truth

Reputation: 419

Show replication status in PostgreSQL

on server

postgres=# select usename,application_name,client_addr,backend_start,state,sync_state from pg_stat_replication ;

usename   | application_name |  client_addr   |         backend_start         |   state   | sync_state 
------------+------------------+----------------+-------------------------------+-----------+------------
replicator | walreceiver      | 192.168.10.132 | 2018-07-06 06:12:20.786918+03 | streaming | async
(1 row)

on client

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 t
 (1 row)


postgres=# select pg_last_xlog_receive_location();
 pg_last_xlog_receive_location 
-------------------------------
 0/540C1DB8


postgres=# select pg_last_xlog_replay_location();
 pg_last_xlog_replay_location 
------------------------------
 0/540C1DB8
 (1 row)

postgres=#    SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
                  THEN 0
                ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
              END AS log_delay;
 log_delay 
-----------
 0
 (1 row)

Upvotes: 30

Related Questions