sentinel cannot promote initial master back to MASTER mode after failover

Question

I am running redis in sentinel mode with 1 master node, 2 replica nodes and 3 sentinel nodes. I am running all the nodes in docker swarm environment. All nodes starts fine. At start we have the following IPs for nodes

master      10.0.20.2 
replica-1   10.0.20.5
replica-2   10.0.20.10

Next I stop the master container to bring master node down so that sentinel should pick one of replica nodes as new master. This goes fine and replica-1 node is selected as new master.

In meantime, docker swarm spin up new container for masterand it joins as slave in the redis sentinel network.

Next, I bring the replica-1 node down for another failover. Now the actual issue happens when sentinel tries to upgrade master node from slave to master.

Below is the masternode redis config file when sentinel tries to make it master. I am wondering why the file is updated with replicaof 10.0.20.2 6379 when this node is the new master and IP is of same node. master node redis.conf

root@0fd67f6ceb37:/data# tail -f /etc/redis/redis.conf
replica-announce-ip "redis-master"
#replica-announce-port 6379
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbchecksum yes
# Generated by CONFIG REWRITE

replicaof 10.0.20.2 6379

This is wrong configuration so it fails in sometime and sentinel picks replica-2 node as new master This is the error I see when masternode logs ( below is the detailed log file) Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master And in the end replica-2 acts as masterand replica-1 and master as two slaves.

master node logs (this is after master joins as slave and sentinel tries to promote it to master mode)

[docker@chopswarm1 redis-failover]$ d logs 0fd67f6ceb37
1:C 05 Nov 2019 06:43:49.360 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 05 Nov 2019 06:43:49.360 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 05 Nov 2019 06:43:49.360 # Configuration loaded
1:M 05 Nov 2019 06:43:49.361 * Running mode=standalone, port=6379.
1:M 05 Nov 2019 06:43:49.361 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 05 Nov 2019 06:43:49.361 # Server initialized
1:M 05 Nov 2019 06:43:49.361 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 05 Nov 2019 06:43:49.361 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 05 Nov 2019 06:43:49.361 * DB loaded from disk: 0.000 seconds
1:M 05 Nov 2019 06:43:49.361 * Ready to accept connections
1:S 05 Nov 2019 06:43:59.817 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 05 Nov 2019 06:43:59.817 * REPLICAOF 10.0.20.5:6379 enabled (user request from 'id=5 addr=10.0.20.7:60534 fd=10 name=sentinel-38a1e461-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=148 qbuf-free=32620 obl=36 oll=0 omem=0 events=r cmd=exec')
1:S 05 Nov 2019 06:43:59.817 # CONFIG REWRITE executed with success.
1:S 05 Nov 2019 06:44:00.386 * Connecting to MASTER 10.0.20.5:6379
1:S 05 Nov 2019 06:44:00.387 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:00.387 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:00.387 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:00.387 * Trying a partial resynchronization (request 0b1ed09c8d497744632c93cab960c4ca4ee9a11e:1).
1:S 05 Nov 2019 06:44:00.388 * Full resync from master: f3c311652d8860c93048eba075521df7033cab2f:38645
1:S 05 Nov 2019 06:44:00.388 * Discarding previously cached master state.
1:S 05 Nov 2019 06:44:00.486 * MASTER <-> REPLICA sync: receiving 178 bytes from master
1:S 05 Nov 2019 06:44:00.486 * MASTER <-> REPLICA sync: Flushing old data
1:S 05 Nov 2019 06:44:00.486 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 05 Nov 2019 06:44:00.486 * MASTER <-> REPLICA sync: Finished with success
1:S 05 Nov 2019 06:44:35.367 # Connection with master lost.
1:S 05 Nov 2019 06:44:35.367 * Caching the disconnected master state.
1:S 05 Nov 2019 06:44:35.464 * Connecting to MASTER 10.0.20.5:6379
1:S 05 Nov 2019 06:44:35.465 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:35.465 # Error condition on socket for SYNC: Connection refused
1:S 05 Nov 2019 06:44:36.466 * Connecting to MASTER 10.0.20.5:6379
1:S 05 Nov 2019 06:44:36.466 * MASTER <-> REPLICA sync started
1:M 05 Nov 2019 06:44:40.748 # Setting secondary replication ID to f3c311652d8860c93048eba075521df7033cab2f, valid up to offset: 46004. New replication ID is 77213f07383dd307e4b6d917b6a8789de42cad20
1:M 05 Nov 2019 06:44:40.748 * Discarding previously cached master state.
1:M 05 Nov 2019 06:44:40.748 * MASTER MODE enabled (user request from 'id=16 addr=10.0.20.7:60576 fd=17 name=sentinel-38a1e461-cmd age=31 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec')
1:M 05 Nov 2019 06:44:40.748 # CONFIG REWRITE executed with success.
1:M 05 Nov 2019 06:44:41.881 * Replica redis-replica-2:6379 asks for synchronization
1:M 05 Nov 2019 06:44:41.881 * Partial resynchronization request from redis-replica-2:6379 accepted. Sending 881 bytes of backlog starting from offset 46004.
1:S 05 Nov 2019 06:44:43.132 # Connection with replica redis-replica-2:6379 lost.
1:S 05 Nov 2019 06:44:43.132 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 05 Nov 2019 06:44:43.132 * REPLICAOF 10.0.20.2:6379 enabled (user request from 'id=24 addr=10.0.20.7:60636 fd=15 name=sentinel-38a1e461-cmd age=3 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=291 qbuf-free=32477 obl=36 oll=0 omem=0 events=r cmd=exec')
1:S 05 Nov 2019 06:44:43.133 # CONFIG REWRITE executed with success.
1:S 05 Nov 2019 06:44:43.484 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:44:43.484 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:43.484 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:43.484 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:43.484 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:44:43.484 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 05 Nov 2019 06:44:44.489 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:44:44.489 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:44.489 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:44.489 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:44.490 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:44:44.490 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 05 Nov 2019 06:44:45.489 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:44:45.490 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:45.490 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:45.490 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:45.490 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:44:45.490 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 05 Nov 2019 06:44:46.493 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:44:46.493 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:46.493 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:46.493 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:46.493 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:44:46.494 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 05 Nov 2019 06:44:47.493 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:44:47.494 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:44:47.494 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:44:47.494 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:44:47.494 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).

<-- omitted few entries for the same errors as above for better readability -->

1:S 05 Nov 2019 06:45:21.575 * Connecting to MASTER 10.0.20.2:6379
1:S 05 Nov 2019 06:45:21.575 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:45:21.575 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:45:21.575 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:45:21.575 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:45:21.575 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 05 Nov 2019 06:45:22.456 * REPLICAOF 10.0.20.10:6379 enabled (user request from 'id=113 addr=10.0.20.7:60950 fd=12 name=sentinel-38a1e461-cmd age=5 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=150 qbuf-free=32618 obl=36 oll=0 omem=0 events=r cmd=exec')
1:S 05 Nov 2019 06:45:22.456 # CONFIG REWRITE executed with success.
1:S 05 Nov 2019 06:45:22.577 * Connecting to MASTER 10.0.20.10:6379
1:S 05 Nov 2019 06:45:22.577 * MASTER <-> REPLICA sync started
1:S 05 Nov 2019 06:45:22.577 * Non blocking connect for SYNC fired the event.
1:S 05 Nov 2019 06:45:22.577 * Master replied to PING, replication can continue...
1:S 05 Nov 2019 06:45:22.577 * Trying a partial resynchronization (request 77213f07383dd307e4b6d917b6a8789de42cad20:46885).
1:S 05 Nov 2019 06:45:22.577 * Successful partial resynchronization with master.
1:S 05 Nov 2019 06:45:22.577 # Master replication ID changed to 3235720aad34423d6f82f9db4a953042c1f16d58
1:S 05 Nov 2019 06:45:22.577 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.

sentinel log file ( have added additional line breaks when failover starts)

root@3708cf05eca4:/data# cat sentinel.log
1:X 05 Nov 2019 06:40:49.116 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 05 Nov 2019 06:40:49.116 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 05 Nov 2019 06:40:49.116 # Configuration loaded
1:X 05 Nov 2019 06:40:49.117 * Running mode=sentinel, port=26379.
1:X 05 Nov 2019 06:40:49.117 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 05 Nov 2019 06:40:49.119 # Sentinel ID is 38a1e461910e17fb7be79e695040074df2dde2df
1:X 05 Nov 2019 06:40:49.119 # +monitor master eaas-redis-master 10.0.20.2 6379 quorum 2
1:X 05 Nov 2019 06:40:49.120 * +slave slave redis-replica-1:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:40:51.183 * +sentinel sentinel 3b0831ce9f6aff70f9bf45f4211d66ebfd1c6a21 10.0.20.33 26379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:40:59.150 * +slave slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:40:59.202 * +fix-slave-config slave redis-replica-1:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:41:01.362 * +sentinel sentinel 464f3750404b419fccf513784f40baf7f6622cba 10.0.20.41 26379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:41:09.249 * +fix-slave-config slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.2 6379



1:X 05 Nov 2019 06:43:48.513 # +sdown master eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:43:48.594 # +new-epoch 1
1:X 05 Nov 2019 06:43:48.595 # +vote-for-leader 464f3750404b419fccf513784f40baf7f6622cba 1
1:X 05 Nov 2019 06:43:48.613 # +odown master eaas-redis-master 10.0.20.2 6379 #quorum 2/2
1:X 05 Nov 2019 06:43:48.613 # Next failover delay: I will not start a failover before Tue Nov  5 06:43:59 2019
1:X 05 Nov 2019 06:43:49.732 # +config-update-from sentinel 464f3750404b419fccf513784f40baf7f6622cba 10.0.20.41 26379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:43:49.732 # +switch-master eaas-redis-master 10.0.20.2 6379 10.0.20.5 6379
1:X 05 Nov 2019 06:43:49.732 * +slave slave 10.0.20.10:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:43:49.732 * +slave slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:43:49.785 * +slave slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:43:59.816 * +convert-to-slave slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:09.832 * +slave slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379



1:X 05 Nov 2019 06:44:40.453 # +sdown master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.524 # +odown master eaas-redis-master 10.0.20.5 6379 #quorum 2/2
1:X 05 Nov 2019 06:44:40.524 # +new-epoch 2
1:X 05 Nov 2019 06:44:40.524 # +try-failover master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.525 # +vote-for-leader 38a1e461910e17fb7be79e695040074df2dde2df 2
1:X 05 Nov 2019 06:44:40.525 # 3b0831ce9f6aff70f9bf45f4211d66ebfd1c6a21 voted for 3b0831ce9f6aff70f9bf45f4211d66ebfd1c6a21 2
1:X 05 Nov 2019 06:44:40.528 # 464f3750404b419fccf513784f40baf7f6622cba voted for 38a1e461910e17fb7be79e695040074df2dde2df 2
1:X 05 Nov 2019 06:44:40.580 # +elected-leader master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.580 # +failover-state-select-slave master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.681 # +selected-slave slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.681 * +failover-state-send-slaveof-noone slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:40.748 * +failover-state-wait-promotion slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:41.003 # +promoted-slave slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:41.003 # +failover-state-reconf-slaves master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:41.101 * +slave-reconf-sent slave 10.0.20.10:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:41.598 # -odown master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:42.050 * +slave-reconf-inprog slave 10.0.20.10:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:42.050 * +slave-reconf-done slave 10.0.20.10:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:42.107 * +slave-reconf-sent slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:43.056 * +slave-reconf-inprog slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:43.056 * +slave-reconf-done slave redis-replica-2:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:43.132 * +slave-reconf-sent slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:44.111 * +slave-reconf-inprog slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:46.056 # +failover-end-for-timeout master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:46.056 # +failover-end master eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:46.056 * +slave-reconf-sent-be slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:46.056 * +slave-reconf-sent-be slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.5 6379
1:X 05 Nov 2019 06:44:46.056 # +switch-master eaas-redis-master 10.0.20.5 6379 10.0.20.2 6379
1:X 05 Nov 2019 06:44:46.057 * +slave slave 10.0.20.10:6379 10.0.20.10 6379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:44:46.057 * +slave slave 10.0.20.5:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:44:51.062 # +sdown slave 10.0.20.5:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.2 6379


1:X 05 Nov 2019 06:45:11.226 # +sdown master eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:45:16.233 # +new-epoch 3
1:X 05 Nov 2019 06:45:16.234 # +vote-for-leader 464f3750404b419fccf513784f40baf7f6622cba 3
1:X 05 Nov 2019 06:45:16.535 # +odown master eaas-redis-master 10.0.20.2 6379 #quorum 3/2
1:X 05 Nov 2019 06:45:16.535 # Next failover delay: I will not start a failover before Tue Nov  5 06:45:26 2019
1:X 05 Nov 2019 06:45:17.285 # +config-update-from sentinel 464f3750404b419fccf513784f40baf7f6622cba 10.0.20.41 26379 @ eaas-redis-master 10.0.20.2 6379
1:X 05 Nov 2019 06:45:17.285 # +switch-master eaas-redis-master 10.0.20.2 6379 10.0.20.10 6379
1:X 05 Nov 2019 06:45:17.285 * +slave slave 10.0.20.5:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.10 6379
1:X 05 Nov 2019 06:45:17.285 * +slave slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.10 6379
1:X 05 Nov 2019 06:45:22.456 * +fix-slave-config slave 10.0.20.2:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.10 6379
1:X 05 Nov 2019 06:45:26.347 * +slave slave redis-replica-1:6379 10.0.20.5 6379 @ eaas-redis-master 10.0.20.10 6379
1:X 05 Nov 2019 06:45:26.348 * +slave slave redis-master:6379 10.0.20.2 6379 @ eaas-redis-master 10.0.20.10 6379
root@3708cf05eca4:/data#

So I want to know why sentinel rewrites the configuration file with replicaof for master node only(it happens only for master node and not for replica nodes when they are promoted to MASTER mode) How can I improve this scenario so that master node can run in MASTER mode again if sentinel picks it up during failover.

Please let me know if any more information is required.

Below are redis configuration files for master and replica nodes when I start the docker swarm stack. redis.conf(master)

dir /data/

replica-announce-ip {{REDIS_MASTER}}

save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbchecksum yes

redis.conf(replica)

replicaof {{REDIS_MASTER}} 6379
dir /data/

replica-announce-ip {{REDIS_REPLICA}}

save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbchecksum yes

sentinel cannot promote initial master back to MASTER mode after failover

Answers (1)

Related Questions