Reputation: 51
Setup: replica set with 5 nodes, version 3.4.5.
Trying to switch PRIMARY with rs.stepDown(60, 30) but consistently getting the error:
rs0:PRIMARY> rs.stepDown(60, 30)
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2017-07-11T00:21:11.205+0000. Please use {force: true} to force node to step down.",
"code" : 50,
"codeName" : "ExceededTimeLimit"
}
However, rs.printSlaveReplicationInfo() running in a parallel terminal confirms that all replicas are fully caught up:
rs0:PRIMARY> rs.printSlaveReplicationInfo()
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
Am I doing something wrong?
UPD: I've checked long running operations before and during rs.stepDown
as was suggested below and it looks like this:
# Before rs.stepDown
$ watch "mongo --quiet --eval 'JSON.stringify(db.currentOp())' | jq -r '.inprog[] | \"\(.secs_running) \(.desc) \(.op)\"' | sort -rnk1"
984287 rsSync none
984287 ReplBatcher none
67 WT RecordStoreThread: local.oplog.rs none
null SyncSourceFeedback none
null NoopWriter none
0 conn615153 command
0 conn614948 update
0 conn614748 getmore
...
# During rs.stepDown
984329 rsSync none
984329 ReplBatcher none
108 WT RecordStoreThread: local.oplog.rs none
16 conn615138 command
16 conn615136 command
16 conn615085 update
16 conn615079 insert
...
Basically, long running user operations seem to happen as a result of rs.stepDown()
as secs_running
becomes nonzero once PRIMARY
attempts to switch over and keeps growing all the way up until stepDown
fails. Then everything gets back to normal.
Any ideas on why this happens and whether that's normal at all?
Upvotes: 5
Views: 11742
Reputation: 540
Quoting an answer from https://jira.mongodb.org/browse/SERVER-27015:
This is most likely due to the fact that by default the shutdown command will only succeed on a primary if the secondaries are fully caught up at the exact moment that the shutdown command is executed.
I faced a similar issue and tried the db.shutdownServer() command several times, however it worked exactly when the secondary was 0 seconds behind the primary.
Upvotes: 0
Reputation: 2184
I have used below command to step down to secondary
db.adminCommand( { replSetStepDown: 120, secondaryCatchUpPeriodSecs: 15, force: true } )
You can find this in below mongodb official documentation
https://docs.mongodb.com/manual/reference/command/replSetStepDown/
Upvotes: 3
Reputation: 13775
To close the loop on this question, it was determined that the failed stepdown was due to time going backward on the host.
MongoDB 3.4.6 is more resilient to time issues on the host, and upgrading the deployment fixes the stalling issues.
Upvotes: 1
Reputation: 1580
Before stepping down, rs.stepDown() will attempt to terminate long running user operations that would block the primary from stepping down, such as an index build, a write operation or a map-reduce job.
Do you have some long running jobs on going? Check db. Check result of db.currentOp()
You can try to set longer stepping down time rs.stepDown(60, 360)
.
Upvotes: 0