Reputation: 117
I am in the process of doing a rolling restart on a 4-node cluster running Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service cassandra stop/start", and noted nothing unusual in either system.log or cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up
user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.187.121 538.95 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.05 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
But doing the same command from any other nodes shows node 1 still down.
user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.187.121 538.94 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.04 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
"nodetool compactionstats" shows no pending tasks, and "nodetool netstats" shows nothing unusual. It's been over 12 hours and these inconsistencies persist. Another example is when I do a "nodetool gossipinfo" on the restarted node, which shows its status as normal:
user@node001=> nodetool -u gossipinfo
/192.168.187.121
generation:1574364410
heartbeat:209150
NET_VERSION:8
RACK:rack1
STATUS:NORMAL,-104847506331695918
RELEASE_VERSION:2.1.9
SEVERITY:0.0
LOAD:5.78684155614E11
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
DC:datacenter1
RPC_ADDRESS:192.168.185.121
Versus another node, which shows node001's status as "shutdown":
user@node002=> nodetool gossipinfo
/192.168.187.121
generation:1491825076
heartbeat:2147483647
STATUS:shutdown,true
RACK:rack1
NET_VERSION:8
LOAD:5.78679987693E11
RELEASE_VERSION:2.1.9
DC:datacenter1
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
RPC_ADDRESS:192.168.185.121
SEVERITY:0.0
Is there something I can do to remedy this current situation - so that I can continue with the rolling restart?
Upvotes: 0
Views: 1413
Reputation: 117
Here's what I ultimately ended up doing to get the "bad" node back into the cluster and to complete the rolling restart:
Perform a complete shutdown
nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart
Monitor for return of the node
until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"
Remove the restarted node from the Cluster
From a different node in the cluster, perform the following command:
nodetool removenode <host-id>
Perform a second complete shutdown
nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart
Monitor for return of the node
until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"
Confirm that the restarted node has rejoined the cluster
Tail the /var/log/cassandra/system.log file from one or more other nodes, looking for the following messages:
INFO [HANDSHAKE-/192.168.187.124] 2019-12-12 19:17:33,654 OutboundTcpConnection.java:485 - Handshaking version with /192.168.187.124
INFO [GossipStage:1] 2019-12-12 19:18:23,212 Gossiper.java:1019 - Node /192.168.187.124 is now part of the cluster
INFO [SharedPool-Worker-1] 2019-12-12 19:18:23,213 Gossiper.java:984 - InetAddress /192.168.187.124 is now UP
Confirm that the expected number of nodes is in the cluster
The result of the following command should be identical across all nodes:
nodetool status
Upvotes: 2