Eemeli Kantola
Eemeli Kantola

Reputation: 5557

Cassandra nodes have different opinions about up/down status and replication. How to fix?

I've got a Cassandra 2.0.1 cluster of three nodes and the main keyspace with replication factor 3. Because of an accidental misconfiguration of one extra fourth node in the cluster, I tried to fix it first with an unnecessary "nodetool decommission" (on node db2) before doing the right thing of "nodetool removenode ".

Now, it seems the node db2 where decommission was run is seeing one another node having status "Down", even though others think everything is up. Additionally, when I run "nodetool ring" on all nodes, db1 gives "Replicas: 2" where db2 and db3 have "Replicas: 3" on top of the listing.

The keyspace contains data I don't want to lose, and the cluster can't be taken completely down because new data is being inserted all the time. What would be a good way to fix the situation without endangering the existing and new data?

Obfuscated nodetool status outputs below.

[db1 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

[db2 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
DN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

[db3 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

Upvotes: 4

Views: 7178

Answers (1)

psanford
psanford

Reputation: 5670

Aaron Morton described in detail how he debugged a similar problem. You should check on the state of gossip in your cluster.

  • Check the state of nodetool gossipinfo
  • Enable to following trace logging:

    log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE

Hopefully from that you can get a better idea what is going on in your cluster.

Upvotes: 4

Related Questions