Reputation: 11
We are encountering the following error in our Cassandra 4.1.7 cluster after updating the replication factor from 1 to 3 in a 5-node cluster:
below is the error:
com.datastax.oss.driver.api.core.servererrors.CASWriteUnknownException: CAS operation result is unknown - proposal was not accepted by a quorum. (1 / 2)
Cassandra Version: 4.1.7
Cluster Size: 5 nodes
Replication Strategy: NetworkTopologyStrategy
Replication Factor: Initially 1, changed to 3 for the test keyspace.
Consistency Level: LOCAL_QUORUM
for both read and write operations.
The issue persists after the replication factor was changed to 3.
All nodes show as up with nodetool status
, and there are no node failures.
Ran nodetool repair
on all nodes.
The error occurs even with minimal load (just a single request).
We are seeking any insights into why this CASWriteUnknownException
error is occurring after increasing the replication factor. Specifically, we are curious whether the issue is related to quorum consistency or if there are other configuration problems in the cluster.
Upvotes: 1
Views: 84
Reputation: 16353
Off the top of my head, the only scenario I could think of where a compare-and-set (CAS) proposal would result in a CASWriteUnknownException
is when the client doesn't know (a) if the proposal reached the accepting nodes, or (b) if the proposal was accepted.
In this scenario, the status is unknown because the client didn't get a response from the nodes about the proposal which can happen as a result of a network interruption or partition.
It would have been handy to have (1) the full error message (not just the exception) plus (2) the full exception stack trace since they might provide clues on the cause of the failure. In the absence of that, I'm inclined to think there is an underlying network issue between the clients and the cluster nodes. Cheers!
Upvotes: 0
Reputation: 57798
While it was stated that nodetool repair
was run, the symptoms described seem to indicate that it was not. I'd recommend re-running a full-repair (not incremental) on all nodes.
I suppose that increasing the RF can increase the load on the cluster. So it may be possible that there is some compute resource contention. Perhaps have a look at metrics around (successful) read latencies, and see if increasing node resources will help.
But again, to me, this sounds like the nodetool repair
was either not run, not successful, or run in "incremental" mode.
Edit 20241210
After doing some additional digging, I'm going to go with resource contention as the prime suspect here. This exception is thrown during the proposal phase of a LWT write operation, so my thought is that the remaining two replicas are too overloaded to "accept" the transaction proposal before it just times-out.
Upvotes: 2