np-hard
np-hard

Reputation: 5815

driver default retry policy

I am testing our cassandra cluster for resiliency, its a 9 node cluster with rf=3. When i disable all traffic on port 7000 of one node, the client gets a

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)

The reason being that this host was only partially able to communicate to other nodes, but it then proceeds to retry again on the same host

INFO  - c.d.d.c.p.LoggingRetryPolicy -  - Retrying on read timeout on same host at consistency QUORUM (initial consistency: QUORUM, required responses: 2, received responses: 1, data retrieved: true, retries: 0)

It continues ALL it retries on the same host, and never recovers, eventually the request fails.

I can create a custom policy, but wondering why it never tries any other nodes ?

Upvotes: 0

Views: 728

Answers (1)

Carlos Monroy Nieblas
Carlos Monroy Nieblas

Reputation: 2283

As per your definitions, the database will have only 3 copies of the information (RF=3), so, even though you have 9 nodes, the QUORUM will be evaluated only with the 3 nodes that actually are the owners of the data, this is defined with the number of tokens and their assignation in the nodes.

Before disabling the port in that node, was the cluster reported as healthy? (in other words, nodetool status reported all the nodes as UN Up and Normal). Is the latency reported by all the nodes similar? If you have a node with increased latencies, the query will timeout before it gets a response from it.

Before creating "custom policies", and once that you confirmed that all the nodes are healthy, reachable and available, you may want to explore using a lower consistency level (like ANY or ONE) which can improve resiliency and performance with an impact of accuracy, or increase the replication factor which will increase the number of nodes where you can find the data but with the inconvenience that the amount of disk utilization will increase.

Upvotes: 2

Related Questions