cgspohn
cgspohn

Reputation: 1

MongoDB - Balancer locked after moveChunk aborted and metadata event not logged - how to recover

We had a problem with a secondary and a moveChunk failed to replicate in the this secondary due to a Fatal Assertion error. The primary mongod reported in the log the failure (see log below) but this information seems to have never made it back to the mongos/mongoc since those were restarting. We have since restored the failed secondary and have completely restarted the entire cluster and things are healthy, except that the balancer will not run because there is a lock still present and I see in the changelog collection that the failed moveChunk only has the moveChunk.start entry. I have reviewed the logs and I am pretty sure there are no migrations happening since we restarted the cluster. Running MongoDB 2.6.1

Logs from the primary mongod showing the moveChunk failure: 2015-10-28T00:03:26.616-0400 [migrateThread] about to log metadata event: { _id: "shard10-2015-10-28T04:03:26-5630490e8d918836ed653d66", server: "shard10", clientAddr: ":27017", time: new Date(1446005006616), what: "moveChunk.to", ns: "prodAB.Instr_2015_10_26_IntervalRecord", details: { min: { appName: "AlertsAccumulator", ts: new Date(1445817600002) }, max: { appName: "CES_GI", ts: new Date(1445830500005) }, step 1 of 5: 1, step 2 of 5: 0, note: "aborted" } } ... 2015-10-28T00:03:26.616-0400 [migrateThread] SyncClusterConnection connecting to [spider:43045] 2015-10-28T00:03:26.617-0400 [migrateThread] warning: Failed to connect to 138.12.88.115:43045, reason: errno:111 Connection refused 2015-10-28T00:03:26.617-0400 [migrateThread] SyncClusterConnection connect fail to: spider:43045 errmsg: couldn't connect to server spider:43045 (xxx), connection attempt failed ... 2015-10-28T00:03:26.635-0400 [migrateThread] not logging config change: shard10-2015-10-28T04:03:26-5630490e8d918836ed653d66 can't authenticate to server spider:43045,spider2:43045,spider3:43045 2015-10-28T00:03:26.635-0400 [migrateThread] ERROR: migrate failed: waitForReplication called but not master anymore 2015-10-28T00:03:26.635-0400 [migrateThread] warning: no need to forget pending chunk [{ appName: "AlertsAccumulator", ts: new Date(1445817600002) },{ appName: "CES_GI", ts: new Date(1445830500005) }) because the local metadata for prodAB.Instr_2015_10_26_IntervalRecord has changed

Entries in the changelog collection:

mongos> db.changelog.find({ns: "prodAB.Instr_2015_10_26_IntervalRecord", what: /^moveChunk./, "details.min.appName": "AlertsAccumulator"})

{ "_id" : "shard09-2015-10-28T02:14:41-56302f917c97e85c20d69b55", 
"server" : "shard09", 
"clientAddr" : "xxx", 
"time" : ISODate("2015-10-28T02:14:41.265Z"), 
"what" : "moveChunk.start", 
"ns" : "prodAB.Instr_2015_10_26_IntervalRecord", 
"details" : { "min" : { "appName" : "AlertsAccumulator", "ts" : ISODate("2015-10-26T00:00:00.002Z") }, 
"max" : { "appName" : "CES_GI", "ts" : ISODate("2015-10-26T03:35:00.005Z") }, 
"from" : "rs10", "to" : "rs1" } }

Upvotes: 0

Views: 1483

Answers (1)

cgspohn
cgspohn

Reputation: 1

To answer my own question, based on what we did in our MongoDB 2.6.1 cluster:

  • Verified there were no migrations happening, we did this by checking the mongos logs and also the mongod logs of our shards. Also checked the operations in progress.
  • We made a backup of our cluster metadata before proceeding, instructions at https://docs.mongodb.org/v2.6/tutorial/backup-sharded-cluster-metadata/
  • We deleted the lock by executing from a mongos:

    use config
    db.locks.remove({'_id': 'balancer'})
    
  • We were able to start the balancer again with sh.startBalancer()

Looks like the extra entry in the changelog did not matter at all. I should note that we chose to disable balancing for certain collections which we know we will be deleting soon and did not want to take the extra burden of balancing at this time, more info on how to disable balacing for specific collections at: https://docs.mongodb.org/v2.6/tutorial/manage-sharded-cluster-balancer/

Upvotes: 0

Related Questions