Reputation: 21

Cassandra nodetool repair freezes entire cluster

Need help understanding what's happening to Cassandra when attempting a nodetool repair on one of the column families in our keyspace.

We are running Cassandra 2.0.7 and have a table we use for indexing object data in our system.

CREATE TABLE ids_by_text (
  object_type text,
  field_name text,
  ref_type text,
  value text,
  ref_id timeuuid,
  PRIMARY KEY((object_type,field_name,ref_type),value,ref_id)
)

Rows can grow to be quite large. We have roughly 10 million objects in the database with an average of 4-6 fields that are indexing them via the table above. It doesn't seem like a lot to me.

When running nodetool repair, we will run for a bit and then hit a point where the following exception is thrown:

ERROR [AntiEntropySessions:8] 2014-07-06 16:47:48,863 RepairSession.java (line 286) [repair #5f37c2e0-052b-11e4-92f5-b9bfa38ef354] session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #5f37c2e0-052b-11e4-92f5-b9bfa38ef354 on apps/ids_by_text, (-7683110849073497716,-7679039947314690170]] Sync failed between /10.0.2.166 and /10.0.2.163
    at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:207)
    at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:236)
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
 INFO [ScheduledTasks:1] 2014-07-06 16:47:48,909 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 66029 ms for 1 collections, 7898896176 used; max is 8547991552
 INFO [GossipTasks:1] 2014-07-06 16:47:48,901 Gossiper.java (line 883) InetAddress /10.0.2.162 is now DOWN
 INFO [GossipTasks:1] 2014-07-06 16:47:49,181 Gossiper.java (line 883) InetAddress /10.0.2.163 is now DOWN
 INFO [GossipTasks:1] 2014-07-06 16:47:49,184 StreamResultFuture.java (line 186) [Stream #da84b3e1-052b-11e4-92f5-b9bfa38ef354] Session with /10.0.2.163 is complete
 WARN [GossipTasks:1] 2014-07-06 16:47:49,186 StreamResultFuture.java (line 215) [Stream #da84b3e1-052b-11e4-92f5-b9bfa38ef354] Stream failed
 INFO [GossipTasks:1] 2014-07-06 16:47:49,187 Gossiper.java (line 883) InetAddress /10.0.2.165 is now DOWN
 INFO [GossipTasks:1] 2014-07-06 16:47:49,188 Gossiper.java (line 883) InetAddress /10.0.2.164 is now DOWN
 INFO [GossipTasks:1] 2014-07-06 16:47:49,189 Gossiper.java (line 883) InetAddress /10.0.2.166 is now DOWN
 INFO [GossipTasks:1] 2014-07-06 16:47:49,189 StreamResultFuture.java (line 186) [Stream #da84b3e0-052b-11e4-92f5-b9bfa38ef354] Session with /10.0.2.166 is complete
 WARN [GossipTasks:1] 2014-07-06 16:47:49,189 StreamResultFuture.java (line 215) [Stream #da84b3e0-052b-11e4-92f5-b9bfa38ef354] Stream failed

At this point, the other nodes will be unresponsive, throwing TPStatus logs and essentially be unresponsive. The system does not recover from this. We are dead.

I went through and ran 'nodetool scrub' on all of the nodes. That worked on most of them, some failed, so I used 'sstablescrub' on them. We wrote a script that did a subrange repair and I can identify the ranges that are problematic, but I haven't done enough testing to know if that is consistent or symptomatic. Testing is tough when it brings production down, so I have to be cautious.

Sidebar question... how do you stop a repair that is underway? If I can see things going sideways, I'd like to stop it.

Note that every other column family in the keyspace repairs just fine.

I am not sure what other detail to give. We have been beating our heads against this for a week and, well, we're stuck.

Upvotes: 2

Answers (2)

Matthew O'Riordan

Reputation: 8211

You can stop a repair in 2.1.* as follows:

wget -q -O jmxterm.jar http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
java -jar ./jmxterm.jar
open localhost:7199 -u [optional username] -p [optional password]
bean org.apache.cassandra.db:type=StorageService
run forceTerminateAllRepairSessions

Upvotes: 1

yukim

Reputation: 641

This(https://issues.apache.org/jira/browse/CASSANDRA-7330) may relate to unresponsiveness after repair failure. It is fixed in the latest 2.0.9 version.

how do you stop a repair that is underway?

It is still work in progress(https://issues.apache.org/jira/browse/CASSANDRA-3486).

Upvotes: 1

Cassandra nodetool repair freezes entire cluster

Answers (2)

Related Questions