roegi
roegi

Reputation: 183

distributed postgres deployment out of sync

I setup a distributed postgres database system and configured replication using hot_standby wal_level.

There is one central master database in place with multiple replicas (15 currently across the world) that are used as read only instances - so no need a failover - we just want to have the data synced to remote locations where we can read them.

master:

wal_level = hot_standby
max_wal_senders = 20
checkpoint_segments = 8    
wal_keep_segments = 8

clients:

wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8    
wal_keep_segments = 8 
hot_standby = on

/var/lib/postgresql/9.4/recovery.conf on Client side:

standby_mode = 'on'
primary_conninfo = 'host=<IP of master> port=5432 user=replicator password=xxxx sslmode=require'
trigger_file = '/tmp/postgresql.trigger'

the replication starts - all seems fine for some days. after some days it seems there are no more connections accepted at the master for replication...

client:

2017-05-04 01:16:51 UTC [9608-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:16:57 UTC [10807-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:02 UTC [12022-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:06 UTC [13217-1] FATAL:  could not connect to the primary server: FATAL:  remaining connection slots are reserved for non-replication superuser connections
...

master:

and then the loag is full of the messages like below - and it never recovers...

2017-05-04 08:44:14 UTC [24850-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:19 UTC [25958-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:24 UTC [27063-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:29 UTC [28144-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:34 UTC [29227-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:39 UTC [30316-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

client:

2017-04-30 11:26:22 UTC [28474-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:22 UTC [28474-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:26 UTC [29328-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:26 UTC [29328-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:31 UTC [30394-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:31 UTC [30394-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

so my question:

Upvotes: 0

Views: 1034

Answers (1)

Vao Tsun
Vao Tsun

Reputation: 51406

https://www.postgresql.org/docs/current/static/runtime-config-replication.html:

max_wal_senders (integer)

Specifies the maximum number of concurrent connections from standby servers or streaming base backup clients (i.e., the maximum number of simultaneously running WAL sender processes). The default is zero, meaning replication is disabled. WAL sender processes count towards the total number of connections, so the parameter cannot be set higher than max_connections. Abrupt streaming client disconnection might cause an orphaned connection slot until a timeout is reached, so this parameter should be set slightly higher than the maximum number of expected clients so disconnected clients can immediately reconnect.

(emphasis mine). Either applications connecitons or orphan connections leaded to your

FATAL: sorry, too many clients already

You might want to use some connection pooler for applications, like pgbouncer, restricting too many connections before they actually happened.

Answering your question if you have archive_command set to actually copying WAL somewhere, modify restore_command in recovery.conf on slave to pick them up. It will allow slave to catch up from the moment it lost stream. Otherwise you have to rebuild it.

Upvotes: 1

Related Questions