PAUL MENA
PAUL MENA

Reputation: 117

Cassandra is not showing a node up hours after restart

I am in the process of doing a rolling restart on a 4-node cluster running Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service cassandra stop/start", and noted nothing unusual in either system.log or cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up

user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns    Host ID                               Rack
UN  192.168.187.121  538.95 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
UN  192.168.187.122  630.72 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
UN  192.168.187.123  572.73 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1
UN  192.168.187.124  625.05 GB  256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1

But doing the same command from any other nodes shows node 1 still down.

user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns    Host ID                               Rack
DN  192.168.187.121  538.94 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
UN  192.168.187.122  630.72 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
UN  192.168.187.123  572.73 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1
UN  192.168.187.124  625.04 GB  256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1

"nodetool compactionstats" shows no pending tasks, and "nodetool netstats" shows nothing unusual. It's been over 12 hours and these inconsistencies persist. Another example is when I do a "nodetool gossipinfo" on the restarted node, which shows its status as normal:

user@node001=> nodetool -u gossipinfo
/192.168.187.121
  generation:1574364410
  heartbeat:209150
  NET_VERSION:8
  RACK:rack1
  STATUS:NORMAL,-104847506331695918
  RELEASE_VERSION:2.1.9
  SEVERITY:0.0
  LOAD:5.78684155614E11
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  DC:datacenter1
  RPC_ADDRESS:192.168.185.121

Versus another node, which shows node001's status as "shutdown":

user@node002=> nodetool gossipinfo
/192.168.187.121
  generation:1491825076
  heartbeat:2147483647
  STATUS:shutdown,true
  RACK:rack1
  NET_VERSION:8
  LOAD:5.78679987693E11
  RELEASE_VERSION:2.1.9
  DC:datacenter1
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  RPC_ADDRESS:192.168.185.121
  SEVERITY:0.0

Is there something I can do to remedy this current situation - so that I can continue with the rolling restart?

Upvotes: 0

Views: 1413

Answers (1)

PAUL MENA
PAUL MENA

Reputation: 117

Here's what I ultimately ended up doing to get the "bad" node back into the cluster and to complete the rolling restart:

Perform a complete shutdown

nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart

Monitor for return of the node

until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"

Remove the restarted node from the Cluster

From a different node in the cluster, perform the following command:

nodetool removenode <host-id>

Perform a second complete shutdown

nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart

Monitor for return of the node

until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"

Confirm that the restarted node has rejoined the cluster

Tail the /var/log/cassandra/system.log file from one or more other nodes, looking for the following messages:

INFO  [HANDSHAKE-/192.168.187.124] 2019-12-12 19:17:33,654 OutboundTcpConnection.java:485 - Handshaking version with /192.168.187.124
INFO  [GossipStage:1] 2019-12-12 19:18:23,212 Gossiper.java:1019 - Node /192.168.187.124 is now part of the cluster
INFO  [SharedPool-Worker-1] 2019-12-12 19:18:23,213 Gossiper.java:984 - InetAddress /192.168.187.124 is now UP

Confirm that the expected number of nodes is in the cluster

The result of the following command should be identical across all nodes:

nodetool status

Upvotes: 2

Related Questions