Cassandra V3 : repair thread hangs when running on multiple nodes

Question

Tech info : Cassandra version : 3.11.4. 2 datacenter, 54 nodes each, with 2 To on disk. RF : 3 for all keyspaces.

Hello, everyone,

I need some help on a puzzling repair issue on a Cassandra cluster, which seems common but that I don't understand :

each weekend, we repair one DC (alternatively) with a distributed "nodetool repair" command sent via SSH to each of the 54 nodes, simultaneously. There are no options for "nodetool repair" which, if we are right, should translate to "-inc -par" in Cassandra V3. We were advised not to use "-pr" with incrementals.
repairs hang. "Nodetool tpstats" shows 1 active and thousands of pending threads for repair sessions. Nothing moves : no streams with "nodetool netstats" , no validation compactions in "nodetool compactionstats".
we checked that there were no previous repairs running and restarted the whole cluster.
Error logs shows only vague warnings like this :

[2021-12-20 23:07:14,358] Repair session 62e5fee1-6179-11ec-9687-81fe085aa34b for range [(3995347772210991689,4008580965241951449]] failed with error [repair #62e5fee1-6179-11ec-9687-81fe085aa34b on keyspace1/cf1, [(3995347772210991689,4008580965241951449]]] Validation failed in /xx.xx.xxx.52 (progress: 18%)

when investigating "xx.xx.xxx.52", we found nothing except, sometimes, a "Cannot start multiple repair sessions over the same sstables" message.
Rolling restarting the cluster frees the hanging threads.
A manual "nodetool repair" works fine.
Two "nodetool repairs" on adjacent nodes hang with the same stucked threads.
The "Reaper" (http://cassandra-reaper.io/) tool will eventually be used but not anytime soon.
Our job works fine with Cassandra V2 so it might be an incremental issue.
We have another Cassandra V3 cluster, smaller (6/6 nodes) which shows the same behavior so we guess we are not doing repairs right.

So, what are we missing here ? Is it possible to repair every node in a DC like we do or are we just fundamentally wrong ?

Does anyone manage to run a repair correctly ?

Any help will be greatly appreciated , as we don't have a clue about how to deal with this issue...

Note : We found many questions about this (example : Simultaneous repairs cause repair to hang) on StackOverflow). One answer points out that there could only be one repaired node at a time, which seems confusing and very inconvenient for a large cluster. Another redirects to a bug but it's not our version (https://issues.apache.org/jira/browse/CASSANDRA-11824)

Can someone share his experience ? Or redirects us to the proper documentation page ? That would be nice.

L.

PS : Excuse my english, it's not my native language.

Cassandra V3 : repair thread hangs when running on multiple nodes

Answers (1)

Related Questions