Reputation: 5557
I've got a Cassandra 2.0.1 cluster of three nodes and the main keyspace with replication factor 3. Because of an accidental misconfiguration of one extra fourth node in the cluster, I tried to fix it first with an unnecessary "nodetool decommission" (on node db2) before doing the right thing of "nodetool removenode ".
Now, it seems the node db2 where decommission was run is seeing one another node having status "Down", even though others think everything is up. Additionally, when I run "nodetool ring" on all nodes, db1 gives "Replicas: 2" where db2 and db3 have "Replicas: 3" on top of the listing.
The keyspace contains data I don't want to lose, and the cluster can't be taken completely down because new data is being inserted all the time. What would be a good way to fix the situation without endangering the existing and new data?
Obfuscated nodetool status outputs below.
[db1 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN xx.xx.xx.99 30.38 MB 256 100.0% cccccccc-cccc-cccc-cccc-cccccccccccc rack1
UN xx.xx.xx.122 28.93 MB 256 100.0% aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa rack1
UN xx.xx.xx.123 29.59 MB 256 100.0% bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb rack1
[db2 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN xx.xx.xx.122 28.93 MB 256 100.0% aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa rack1
UN xx.xx.xx.99 30.38 MB 256 100.0% cccccccc-cccc-cccc-cccc-cccccccccccc rack1
UN xx.xx.xx.123 29.59 MB 256 100.0% bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb rack1
[db3 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN xx.xx.xx.122 28.93 MB 256 100.0% aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa rack1
UN xx.xx.xx.99 30.38 MB 256 100.0% cccccccc-cccc-cccc-cccc-cccccccccccc rack1
UN xx.xx.xx.123 29.59 MB 256 100.0% bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb rack1
Upvotes: 4
Views: 7178
Reputation: 5670
Aaron Morton described in detail how he debugged a similar problem. You should check on the state of gossip in your cluster.
nodetool gossipinfo
Enable to following trace logging:
log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
Hopefully from that you can get a better idea what is going on in your cluster.
Upvotes: 4