Chaity
Chaity

Reputation: 1388

How failure detection and recovery mechanism in cassandra works?

To all Cassandra experts,

I am trying to understand cassandra failure detection and recovery. I am a little bit confused on how this exactly works.

From Datastax Doc:

Configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector. Lower values increase the likelihood that an unresponsive node will be marked as down, while higher values decrease the likelihood that transient failures causing node failure. In unstable network environments (such as EC2 at times), raising the value to 10 or 12 helps prevent false failures.

From http://ljungblad.nu/post/44006928392/cassandra-and-its-accrual-failure-detector

Phi represents the likelihood that Node A is wrong about Node B’s state.The higher the Phi, the bigger the confidence that Node B has failed.

Can someone explain me in details C* failure detection mechanism and how C* recovers it in different scenarios.

Thanks in advance

Chaity

Upvotes: 4

Views: 2198

Answers (1)

Myles Baker
Myles Baker

Reputation: 3760

I don't consider myself a Cassandra expert, but here is my take on Cassandra's node failure detection :

  1. Once per second, each node contacts 1-3 other nodes asking about the node state and location. These time-stamped messages are past of the Gossip protocol.
  2. The Snitch informs the partitioner of a node's rack and data center topology. A dynamic snitch can detect if nodes are functioning at poor performance (read and write) levels and not perform read or write operations until it is functioning properly.
  3. Hinted Handoff is a recovery mechanism for partition writes targeting offline nodes. The Coordinator stores whether or not each node on the write path acknowledges the write operation and stores the hint in the system.hints table. The write is re-attempted if the target node comes back online.

All of these communication methods work together when nodes go offline or are performing poorly, and can be configured. As far as I know, Cassandra will not bring nodes back to life after failure; this requires human intervention to bring the node back online and run nodetool to repair the data on the failed node.

Depending on your organization's failure tolerance for read and write operations, you can always configure the consistency level.

Some resources for managing node failure:

  1. (Check your C* version first) DataStax Failure detection and recovery
  2. C* High Availability from Planet Cassandra
  3. Configuring Consistency Level

Upvotes: 3

Related Questions