enzoyang
enzoyang

Reputation: 877

Mongodb, all replSet stuck at Startup2

I have mongodb replication set with 2 node(node0, node1), one day one of it(node1) crash.

considering deleting all data of node1 and restart it will take a long time, I shutdown node0 and rsync data to node1

after that, I start node0 and node1. both replSet stuck at STARTUP2, bellow is some log:

Sat Feb  8 13:14:22.031 [rsMgr] replSet I don't see a primary and I can't elect myself
Sat Feb  8 13:14:24.888 [rsStart] replSet initial sync pending
Sat Feb  8 13:14:24.889 [rsStart] replSet initial sync need a member to be primary or secondary to do our initial sync

How to solve this problem?

Upvotes: 3

Views: 11490

Answers (2)

yaoxing
yaoxing

Reputation: 4193

EDIT 10/29/15: I found there's actually an easier way to find back your primary by using rs.reconfig with option {force: true}. You can find detail document here. Use with caution though as mentioned in the document it may cause rollback.

You should never build a 2-member replica set because once one of them is down, the other one wouldn't know if it's because the other one is down, or itself has been cut off from network. As a solution, add an arbiter node for voting.

So your problem is, when you restart node0, while node1 is already dead, no other node votes to it. it doesn't know if it's suitable to run a a primary node anymore. Thus it falls back to a secondary, that's why you see the message

Sat Feb  8 13:14:24.889 [rsStart] replSet initial sync need a member to be primary or secondary to do our initial sync

I'm afraid as I know there's no other official way to resolve this issue other than rebuilding the replica set (but you can find some tricks later). Follow these steps:

  1. Stop node0
  2. Go to the data folder of node0 (on my machine it's /var/lib/mongodb. find yours in config file located at /etc/mongodb.conf)
  3. delete local.* from the folder. note that
    1. this undoable, even if you backed up these files.
    2. You'll lose all the users in local database.
  4. Start node0 and you shall see it running as a standalone node.

Then follow mongodb manual to recreate a replica set

  1. run rs.initiate() to initialize replica set
  2. add node1 to replica set: rs.add("node1 domain name");

I'm afraid you'll have to spend a long time waiting for the sync to finish. And then you are good to go.

I strongly recommend adding an arbiter to avoid this situation again.

So, above is the official way to reolve your issue, and this is how I did it with MongoDB 2.4.8. I didn't find any document to prove it so there's absolutely NO gurantee. you do it on your own risk. Anyway, if it doesn't work for you, just fallback to the official way. Worth tryng ;)

  1. make sure in the whole progress no application is trying to modify your database. otherwise these modifications will not be synced to secondary server.
  2. restart your server without the replSet=[set name] parameter, so that it runs as standalone, and you can do modifications to it.
  3. go to local database, and delete node1 from db.system.replset. for example in my machine originally it's like:

    { "_id": "rs0", "version": 5, "members": [{ "_id": 0, "host": "node0" }, { "_id": 1, "host": "node1" }] }

You should change it to

{
    "_id": "rs0",
    "version": 5,
    "members": [{
        "_id": 0,
        "host": "node0"
    }]
}
  1. Restart with replSet=[set name] and you are supposed to see node0 become primary again.
  2. Add node1 to the replica set with rs.add command.

That's all. Let me know if you should have any question.

Upvotes: 5

Constantin Guay
Constantin Guay

Reputation: 1664

I had the same issue when using MMS. I created a new ReplicaSet of 3 machines (2 data + 1 arbiter, which is tricky to setup on MMS btw) and they were all in STARTUP2 "initial sync need a member to be primary or secondary to do our initial sync"

myReplicaSet:STARTUP2> rs.status()
{
        "set" : "myReplicaSet",
        "date" : ISODate("2015-01-17T21:20:12Z"),
        "myState" : 5,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "server1.mydomain.com:27000",
                        "health" : 1,
                        "state" : 5,
                        "stateStr" : "STARTUP2",
                        "uptime" : 142,
                        "optime" : Timestamp(0, 0),
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2015-01-17T21:20:12Z"),
                        "lastHeartbeatRecv" : ISODate("2015-01-17T21:20:11Z"),
                        "pingMs" : 0,
                        "lastHeartbeatMessage" : "initial sync need a member to be primary or secondary to do our initial sync"
                },
                {
                        "_id" : 1,
                        "name" : "server2.mydomain.com:27000",
                        "health" : 1,
                        "state" : 5,
                        "stateStr" : "STARTUP2",
                        "uptime" : 142,
                        "optime" : Timestamp(0, 0),
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "infoMessage" : "initial sync need a member to be primary or secondary to do our initial sync",
                        "self" : true
                },
                {
                        "_id" : 3,
                        "name" : "server3.mydomain.com:27000",
                        "health" : 1,
                        "state" : 5,
                        "stateStr" : "STARTUP2",
                        "uptime" : 140,
                        "lastHeartbeat" : ISODate("2015-01-17T21:20:12Z"),
                        "lastHeartbeatRecv" : ISODate("2015-01-17T21:20:10Z"),
                        "pingMs" : 0
                }
        ],
        "ok" : 1
}

To fix it, I used yaoxing answer. I had to shutdown the ReplicaSet on MMS, and wait for all members to be shut. It took a while... Then, On all of them, I removed the content of the data dir:

sudo rm -Rf /var/data/*

And only after that, I turned the ReplicaSet On and all was fine.

Upvotes: 0

Related Questions