Reputation: 466
I have a cassandra cluster with version 2.0.9 running. Nodetool hasn't been running since the start (as it was not requested to schedule these repairs). Each node has around 8GB of data. That seems rather small to me. When I try to run nodetool repair it seems to take forever (not finished after 2 days).
I don't see any progress. I've been reading threads where they tell you to check compactionstats and netstats but those indicate no traffic. However the nodetool repair command never exits. That doesn't seem normal to me. I got messages about the system keyspace being repaired and being ok. However the actual data we put in it doesn't return anything. All nodes are up. I've checked in the system.log (CentOS 6 BTW) for errors and there aren't any. I've started a command that checks if the number of commands and responses are still going up (which is the case) however I wonder if this might be from something else or if this is directly linked to the nodetool repair. There doesn't seem to be any IO/net saturation. So yesterday I started the repair again with a tool range-repair.py. The last 12 hours there has been no extra output. Last output was:
INFO 2015-11-01 20:55:46,268 repair line: 296 : [1/256] repairing range (-09214247901397780884, -09166106147119295777) in 100 steps for keyspace <all>
The main issue with this repair taking forever(or just repair being hung) is that we want to upgrade cassandra for app deployment. The procedure says do a nodetool repair first. Is this actually necessary before you start the upgrade? Maybe nodetool works more efficient (you now also have an incremental option).
Who can help me here? Thanks a lot in advance!
Upvotes: 0
Views: 661
Reputation: 466
I'm not sure if this fully resolved the issue, however after doing a rolling restart of the whole cluster it seemed that nodetool repair was able to finish on where it didn't before. For another keyspace I got an issue that I had to start the process over and over again to get any progress. I used range_repair.py which allowed me to skip to a certain token so I could slowly go up. In the end I used dry-run and steps option (1 step) and directed that to a file. Then I filtered the first column with sed and executed that file. If the command seems to hang you can note it down, CTRL-C it and rerun again afterwards. Generally it succeeded the second or third time I ran it.
Upvotes: 1