Raj
Raj

Reputation: 374

Issue with Cassandra while re-adding existing nodes

I have cassandra 4.1.5 version installed on my DC and DR serves. Below is initial configuration I had -

On DC:

I have 3 nodes with IPs x.x.x.112, x.x.x.113 and x.x.x.114 where x.x.x.114 is seed node and is part of seed_provider list in all 3 nodes. All 3 nodes have below configuration - cassandra-rackdc.properties:

dc=dc1
rack=rack1

cassandra.yaml:

num_tokens: 16
seed_provider:
     - seeds: "x.x.x.114, x.x.y.114"
listen_address: x.x.x.112 (IP address corresponding node)
rpc_address: x.x.x.112 (IP address corresponding node)
endpoint_snitch: GossipingPropertyFileSnitch

On DR:

I have 3 nodes with IPs x.x.y.112, x.x.y.113 and x.x.y.114 where x.x.y.114 is seed node and is part of seed_provider list in all 3 nodes. All 3 nodes have below configuration - cassandra-rackdc.properties:

dc=dr1
rack=rack1

cassandra.yaml:

num_tokens: 16
seed_provider:
     - seeds: "x.x.x.114,x.x.y.114"
listen_address: x.x.y.112 (IP address corresponding node)
rpc_address: x.x.y.112 (IP address corresponding node)
endpoint_snitch: GossipingPropertyFileSnitch

Due to some issues on network and infra, I had to remove the DR nodes using nodetool removenode and then I reinstalled and restarted the DR nodes. Post that I found that DR nodes are not connecting. Also, I removed DC and DR seed nodes from seed_providers properties of both clusters. I can see DR nodes in gossipinfo on DC and one of the node is in removed status and couple of nodes are in removed with REMOVAL_COORDINATOR where this coordinator node is not in cluster.

My custom schema is as of now as CREATE KEYSPACE mykeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND durable_writes = true; and this way I am able to serve from DC. But I want to setup my DR back now and want to change the keyspace as ALTER KEYSPACE mykeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', ''dr1': '3'} AND durable_writes = true;.

I have tried restarting nodes and cleaning data folder in DR but somehow DR nodes are behaving abnormal. One of node i.e. x.x.y.114 is automatically getting connected to DC nodes and its coming in nodetool status even there is no cross reference in DC and DR. And if I do nodetool status on DR node i.e. x.x.y.112, I can see DC nodes in DN status but x.x.y.114 is not part of it.

Is there anyway I can clean DR completely and add these nodes back as DR.

Update after removing nodes and waiting for 72 hours - Below error is appearing if I add x.x.y.112 in seed provider of x.x.y.114 and 114 start syncing schema with DC nodes. Below is error -

ERROR [main] 2024-09-24 10:31:22,915 DefaultSchemaUpdateHandler.java:142 - Didn't receive schemas for all known versions within the PT30S. Use -Dcassandra.skip_schema_check=true to skip this check.
ERROR [main] 2024-09-24 10:31:22,917 CassandraDaemon.java:900 - Exception encountered during startup
java.lang.IllegalStateException: Could not achieve schema readiness in PT30S
        at org.apache.cassandra.service.StorageService.waitForSchema(StorageService.java:1140)
        at org.apache.cassandra.dht.BootStrapper.allocateTokens(BootStrapper.java:234)
        at org.apache.cassandra.dht.BootStrapper.getBootstrapTokens(BootStrapper.java:180)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1192)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1145)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:936)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:854)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:421)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:744)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:878)
INFO  [StorageServiceShutdownHook] 2024-09-24 10:31:22,919 HintsService.java:235 - Paused hints dispatch

Upvotes: 1

Views: 188

Answers (2)

Aaron
Aaron

Reputation: 57798

It sounds like gossip is messed up. Try truncating the system.peers table on each node.

TRUNCATE system.peers;

While gossip is dynamic, it is also stored locally on each node. Running a removenode should have cleared out those old entries. But sometimes, it doesn't. The IPs new host_ids won't match up with older gossip, so clearing it out is the best move.

And if system.peers is empty, Cassandra will rebuild it automatically.

EDIT 20250124

Note that with Cassandra >= v4.0, you'll want to TRUNCATE the system.peers_v2 table.

Upvotes: 2

Erick Ramirez
Erick Ramirez

Reputation: 16353

It's not clear to me why you removed the nodes but we don't recommend doing that if you intend to keep them in the cluster for the long term. You should have resolved the underlying issue with the network instead of removing the nodes because as you found out, it created another set of issues which didn't solve the original problem with the infrastructure.

When a node is removed from the cluster, it is placed into "quarantine" for 72 hours (3 days) to prevent the node from accidentally re-joining the cluster, for example, if an operator inadvertently restarts Cassandra.

A removed node will not be able to gossip with other nodes. It will stay in quarantine for 72 hours and cannot be added back to the cluster until the 3-day quarantine period has expired.

To override the 3-day quarantine period, you will need to set the system property very_long_time_ms on the command line with -Dcassandra. very_long_time_ms=<number_in_milliseconds> on ALL nodes which will require a restart. You will then need to restart the nodes again withOUT this setting to reset it to the default value. Cheers!

Upvotes: 1

Related Questions