ayush singhal
ayush singhal

Reputation: 11

CASWriteUnknownException in Cassandra 4.1.7 After Changing Replication Factor to 3

We are encountering the following error in our Cassandra 4.1.7 cluster after updating the replication factor from 1 to 3 in a 5-node cluster:

below is the error:

com.datastax.oss.driver.api.core.servererrors.CASWriteUnknownException: CAS operation result is unknown - proposal was not accepted by a quorum. (1 / 2)

Key Information:

Observations:

Request:

We are seeking any insights into why this CASWriteUnknownException error is occurring after increasing the replication factor. Specifically, we are curious whether the issue is related to quorum consistency or if there are other configuration problems in the cluster.

Upvotes: 1

Views: 84

Answers (2)

Erick Ramirez
Erick Ramirez

Reputation: 16353

Off the top of my head, the only scenario I could think of where a compare-and-set (CAS) proposal would result in a CASWriteUnknownException is when the client doesn't know (a) if the proposal reached the accepting nodes, or (b) if the proposal was accepted.

In this scenario, the status is unknown because the client didn't get a response from the nodes about the proposal which can happen as a result of a network interruption or partition.

It would have been handy to have (1) the full error message (not just the exception) plus (2) the full exception stack trace since they might provide clues on the cause of the failure. In the absence of that, I'm inclined to think there is an underlying network issue between the clients and the cluster nodes. Cheers!

Upvotes: 0

Aaron
Aaron

Reputation: 57798

While it was stated that nodetool repair was run, the symptoms described seem to indicate that it was not. I'd recommend re-running a full-repair (not incremental) on all nodes.

I suppose that increasing the RF can increase the load on the cluster. So it may be possible that there is some compute resource contention. Perhaps have a look at metrics around (successful) read latencies, and see if increasing node resources will help.

But again, to me, this sounds like the nodetool repair was either not run, not successful, or run in "incremental" mode.

Edit 20241210

After doing some additional digging, I'm going to go with resource contention as the prime suspect here. This exception is thrown during the proposal phase of a LWT write operation, so my thought is that the remaining two replicas are too overloaded to "accept" the transaction proposal before it just times-out.

Upvotes: 2

Related Questions