devilsansclue
devilsansclue

Reputation: 199

bdr_init_copy hangs indefinitely

Fairly new to Postgresql, but have to get replication set up. I settled on BDR, and it works fine in the local demo, but on distributed machines it starts to get problematic, mostly because I have no real clue what the hell I am doing, and I cry myself to sleep pining for MySQL. I've gotten BDR working accross multiple servers, almost. When I run:

SELECT bdr.bdr_node_join_wait_for_ready();

on the joining nodes it hangs. This happens on both DB2 and DB3. DB1 returns a valid response. Researching this I came across the bdr_init_copy command, which apparently does everything I have been doing by hand, and then some. so tried that out. Now, when I run:

/usr/lib/postgresql/9.4/bin/bdr_init_copy -d "host=192.168.1.10 dbname=demo3" --local-dbname="host=192.168.1.23 dbname=demo3" -n db2 -D bdr_data

I get

bdr_init_copy: starting ...
Getting remote server identification ...
Detected 2 BDR database(s) on remote server
Updating BDR configuration on the remote node:
 demo2: creating replication slot ...
 demo2: creating node entry for local node ...
 demo3: creating replication slot ...
 demo3: creating node entry for local node ...
Creating base backup of the remote node...
63655/63655 kB (100%), 1/1 tablespace
Creating restore point on remote node ...
Bringing local node to the restore point ...

And it sits there. I am assuming that it is the same cause for both issues. as far as I can tell there are no log entries created on the local node (db2) but the following is present on the remote(db1)

2016-10-12 22:38:43 UTC [20808-1] postgres@demo2 LOG:  logical decoding found consistent point at 0/5001F00
2016-10-12 22:38:43 UTC [20808-2] postgres@demo2 DETAIL:  There are no running transactions.
2016-10-12 22:38:43 UTC [20808-3] postgres@demo2 STATEMENT:  SELECT pg_create_logical_replication_slot('bdr_17163_6340711416785871202_2_17163__', 'bdr');
2016-10-12 22:38:43 UTC [20811-1] postgres@demo3 LOG:  logical decoding found consistent point at 0/5002090
2016-10-12 22:38:43 UTC [20811-2] postgres@demo3 DETAIL:  There are no running transactions.
2016-10-12 22:38:43 UTC [20811-3] postgres@demo3 STATEMENT:  SELECT pg_create_logical_replication_slot('bdr_17939_6340711416785871202_2_17939__', 'bdr');
2016-10-12 22:38:44 UTC [20812-1] postgres@demo3 LOG:  restore point "bdr_6340711416785871202" created at 0/50022A8
2016-10-12 22:38:44 UTC [20812-2] postgres@demo3 STATEMENT:  SELECT pg_create_restore_point('bdr_6340711416785871202')

Any help out there?

Upvotes: 1

Views: 758

Answers (1)

JRC
JRC

Reputation: 61

Right, just had this issue and none of the other forums were any help. Some of them even say things like it is okay for the new node to report its status as "o" and the other nodes report the new server status as "i" because "this is just a bug and it's fine". It's NOT OKAY. The new server could receive replication updates, but no primary updates were possible on the new server. The key to solving this problem is to crank up the logging on the server you are joining to (not the new one). On the new server logs, you might see things like: 08006: could not receive data from client: Connection reset by peer, which is not very helpful, and will have you checking firewalls, etc. The real money shot will come from the source server logs when they have logs saying something like: no free replication state could be found for 11, increase max_replication_slots What's probably happened is you either have too many servers in your cluster for the default settings or, more likely, there is some junk left over from old hosts.

You need to clean things up ... ON EVERY SERVER IN THE EXISTING CLUSTER (NB!). Start by getting a list of things on the existing cluster:

select * from bdr.bdr_nodes order by node_sysid;

THEN, check the following:

select conn_sysid,conn_dboid from bdr.bdr_connections order by conn_sysid;

.. if you see old entries (that don't contain node_sysid from the first query) then delete eg. delete from bdr.bdr_connections where conn_sysid='<from-first-query>';

select * from pg_replication_slots order by slot_name;

.. if you see old entries that don't contain an active sysid then delete .. NB, use the function, DO NOT do a "delete from" eg. select pg_drop_replication_slot('bdr_17213_6574566740899221664_1_17213__');

select * from pg_replication_identifier order by riname;

.. if you see old entries that don't contain an active sysid then delete .. NB, use the function, DO NOT do a "delete from"

select pg_replication_identifier_drop('bdr_6443767151306784833_1_17210_17213_');

With any luck, after you've done this on every node, you will see your new server's BDR status go to 'r'. As you clean up each host, you should notice that the logs "08006: could not receive data from client: Connection reset by peer", matching the conn-sysid of the server you've just cleaned up, stop happening. Good luck

Upvotes: 1

Related Questions