yravi104
yravi104

Reputation: 23

Automatic recovery of the failed postgresql master node is not working with pgpool II

I am new to Postgresql and Pgpool II setup. I have configured the Postgresql HA/Load balancing using Pgpool and Repmgr.

The setup consist of 3 nodes and verison of Application and OS is as mentioned below:  
**Pgpool node** => 192.168.0.4, **Postgresql Nodes** => 192.168.0.6, 192.168.0.7   
**OS version** => CentOS 6.8 (On all the 3 nodes)  
**Pgpool II version** =>  pgpool-II version 3.5.0 (ekieboshi).  
**Postgresql Version** => PostgreSQL 9.4.8  
**Repmgr Version** => repmgr 3.1.3 (PostgreSQL 9.4.8)  

I have followed the link to do the setup.

When I bring down the master node, the failover happens successfully and the Slave node takes over as the new Master node.
After failover, I have to recover the failed node manually and sync it with the new Master node.

I want to automate the recovery process.

The pgpool.conf file on the pgpool node contains parameter recovery_1st_stage_command. I have searched the sources online and found that the paramater "recovery_1st_stage_command" should be set in the configuration file pgpool.conf on the Pgpool node.

I have set the parameter recovery_1st_stage_command = 'basebackup.sh'. I have placed the script 'basebackup.sh' file on both the Postgresql node under the data directory '/var/lib/pgsql/9.4/data'.

Also I have placed the script 'pgpool_remote_start' on both the database node under the directory '/var/lib/pgsql/9.4/data'.

Also created the pgpool extension pgpool_recovery and pgpool_adm on both the database node.

When the Master node is stopped, the failover happens but the recovery script 'basebackup.sh' is not executed.
I have checked the pgpool logs and enabled debug level as well. Still cannot find whether the script got executed or not.

Please help me with the automatic online recovery of the failed node. Find the scripts used by me.

basebackup.sh

#!/bin/bash
# first stage recovery
# $1 datadir
# $2 desthost
# $3 destdir

#as I'm using repmgr it's not necessary for me to know datadir(master) $1
RECOVERY_NODE=$2
CLUSTER_PATH=$3
#repmgr needs to know the master's ip
MASTERNODE=`/sbin/ifconfig eth0 | grep inet | awk '{print $2}' | sed 's/addr://'`

cmd1=`ssh postgres@$RECOVERY_NODE "repmgr -D $CLUSTER_PATH --force standby clone $MASTERNODE"`
echo $cmd1

pgpool_remote_start script.

#! /bin/sh

if [ $# -ne 2 ]
then
    echo "pgpool_remote_start remote_host remote_datadir"
    exit 1
fi

DEST=$1
DESTDIR=$2
PGCTL=/usr/pgsql-9.4/bin/pg_ctl

ssh -T $DEST $PGCTL -w -D $DESTDIR start 2>/dev/null 1>/dev/null < /dev/null &

Thanks.

Upvotes: 1

Views: 3669

Answers (1)

Saule
Saule

Reputation: 589

I think this is as designed. When a master fails, there is a failover and so the slave gets promoted. But the old master is not automatically recovered as a slave. At the contrary, usually the failover script will try to shutdown the failed master for good and disable it from restarting (if possible, maybe the node is down and it is not possible to connect to), this to avoid a split-brain.

If you really want that then you could modify the failover script in such a way that it will do the pcp_recovery operation on the old master after the slave is promoted. But then what you are in fact doing is a switchover... this should be scripted as a series of step. A failover is for when there is a real issue with the master (like machine not responding)

Upvotes: 3

Related Questions