Reputation: 21
Tech info : Cassandra version : 3.11.4. 2 datacenter, 54 nodes each, with 2 To on disk. RF : 3 for all keyspaces.
Hello, everyone,
I need some help on a puzzling repair issue on a Cassandra cluster, which seems common but that I don't understand :
each weekend, we repair one DC (alternatively) with a distributed "nodetool repair" command sent via SSH to each of the 54 nodes, simultaneously. There are no options for "nodetool repair" which, if we are right, should translate to "-inc -par" in Cassandra V3. We were advised not to use "-pr" with incrementals.
repairs hang. "Nodetool tpstats" shows 1 active and thousands of pending threads for repair sessions. Nothing moves : no streams with "nodetool netstats" , no validation compactions in "nodetool compactionstats".
we checked that there were no previous repairs running and restarted the whole cluster.
Error logs shows only vague warnings like this :
[2021-12-20 23:07:14,358] Repair session 62e5fee1-6179-11ec-9687-81fe085aa34b for range [(3995347772210991689,4008580965241951449]] failed with error [repair #62e5fee1-6179-11ec-9687-81fe085aa34b on keyspace1/cf1, [(3995347772210991689,4008580965241951449]]] Validation failed in /xx.xx.xxx.52 (progress: 18%)
when investigating "xx.xx.xxx.52", we found nothing except, sometimes, a "Cannot start multiple repair sessions over the same sstables" message.
Rolling restarting the cluster frees the hanging threads.
A manual "nodetool repair" works fine.
Two "nodetool repairs" on adjacent nodes hang with the same stucked threads.
The "Reaper" (http://cassandra-reaper.io/) tool will eventually be used but not anytime soon.
Our job works fine with Cassandra V2 so it might be an incremental issue.
We have another Cassandra V3 cluster, smaller (6/6 nodes) which shows the same behavior so we guess we are not doing repairs right.
So, what are we missing here ? Is it possible to repair every node in a DC like we do or are we just fundamentally wrong ?
Does anyone manage to run a repair correctly ?
Any help will be greatly appreciated , as we don't have a clue about how to deal with this issue...
Note : We found many questions about this (example : Simultaneous repairs cause repair to hang) on StackOverflow). One answer points out that there could only be one repaired node at a time, which seems confusing and very inconvenient for a large cluster. Another redirects to a bug but it's not our version (https://issues.apache.org/jira/browse/CASSANDRA-11824)
Can someone share his experience ? Or redirects us to the proper documentation page ? That would be nice.
L.
PS : Excuse my english, it's not my native language.
Upvotes: 2
Views: 623
Reputation: 16313
If you think about repairs in Cassandra, its goal is to synchronise the data between replicas (nodes) so if you have a replication factor of 3 in a C* data centre then running a repair on a node will also cause repair operations to run on the replica nodes.
If you run repairs on each of those replica nodes (in parallel) then you end up with multiple repair operations synchronising (fixing inconsistencies) over the same set of data. Those repairs are all competing for the same resources on each of the nodes and that is why you run into issues.
We recommend that you perform a rolling repair instead, one node at a time until all nodes are repaired. You only need to run repair once every gc_grace_seconds
which by default is 10 days so we suggest scheduling repairs once a week.
For example, if you have a 6-node cluster then schedule repairs on node 1 on Mondays, node 2 on Tuesdays and so on. For larger clusters, you have to do a bit more work and only run repairs in parallel on nodes which are not "adjacent" to each other in the ring so they don't have overlapping token ranges. This is a bit difficult to understand and therefore manage so we recommend that you use automated tools like Reaper. Cheers!
Upvotes: 3