MongoDB - Balancer locked after moveChunk aborted and metadata event not logged - how to recover

Question

We had a problem with a secondary and a moveChunk failed to replicate in the this secondary due to a Fatal Assertion error. The primary mongod reported in the log the failure (see log below) but this information seems to have never made it back to the mongos/mongoc since those were restarting. We have since restored the failed secondary and have completely restarted the entire cluster and things are healthy, except that the balancer will not run because there is a lock still present and I see in the changelog collection that the failed moveChunk only has the moveChunk.start entry. I have reviewed the logs and I am pretty sure there are no migrations happening since we restarted the cluster. Running MongoDB 2.6.1

How do we get the balancer running again? Do we simply remove the lock entry (from mongos, use config, db.locks.remove({'_id': 'balancer'})?
Should we be concerned with the changelog entries at all? should we remove the entry for moveChunk.start in changelog that never finished? should we insert the abort entry that the primary mongod logged but got lost?

Logs from the primary mongod showing the moveChunk failure: 2015-10-28T00:03:26.616-0400 [migrateThread] about to log metadata event: { _id: "shard10-2015-10-28T04:03:26-5630490e8d918836ed653d66", server: "shard10", clientAddr: ":27017", time: new Date(1446005006616), what: "moveChunk.to", ns: "prodAB.Instr_2015_10_26_IntervalRecord", details: { min: { appName: "AlertsAccumulator", ts: new Date(1445817600002) }, max: { appName: "CES_GI", ts: new Date(1445830500005) }, step 1 of 5: 1, step 2 of 5: 0, note: "aborted" } } ... 2015-10-28T00:03:26.616-0400 [migrateThread] SyncClusterConnection connecting to [spider:43045] 2015-10-28T00:03:26.617-0400 [migrateThread] warning: Failed to connect to 138.12.88.115:43045, reason: errno:111 Connection refused 2015-10-28T00:03:26.617-0400 [migrateThread] SyncClusterConnection connect fail to: spider:43045 errmsg: couldn't connect to server spider:43045 (xxx), connection attempt failed ... 2015-10-28T00:03:26.635-0400 [migrateThread] not logging config change: shard10-2015-10-28T04:03:26-5630490e8d918836ed653d66 can't authenticate to server spider:43045,spider2:43045,spider3:43045 2015-10-28T00:03:26.635-0400 [migrateThread] ERROR: migrate failed: waitForReplication called but not master anymore 2015-10-28T00:03:26.635-0400 [migrateThread] warning: no need to forget pending chunk [{ appName: "AlertsAccumulator", ts: new Date(1445817600002) },{ appName: "CES_GI", ts: new Date(1445830500005) }) because the local metadata for prodAB.Instr_2015_10_26_IntervalRecord has changed

Entries in the changelog collection:

mongos> db.changelog.find({ns: "prodAB.Instr_2015_10_26_IntervalRecord", what: /^moveChunk./, "details.min.appName": "AlertsAccumulator"})

{ "_id" : "shard09-2015-10-28T02:14:41-56302f917c97e85c20d69b55", 
"server" : "shard09", 
"clientAddr" : "xxx", 
"time" : ISODate("2015-10-28T02:14:41.265Z"), 
"what" : "moveChunk.start", 
"ns" : "prodAB.Instr_2015_10_26_IntervalRecord", 
"details" : { "min" : { "appName" : "AlertsAccumulator", "ts" : ISODate("2015-10-26T00:00:00.002Z") }, 
"max" : { "appName" : "CES_GI", "ts" : ISODate("2015-10-26T03:35:00.005Z") }, 
"from" : "rs10", "to" : "rs1" } }

MongoDB - Balancer locked after moveChunk aborted and metadata event not logged - how to recover

Answers (1)

Related Questions