Reputation: 306
I created a custom retryPolicy to implement backoff for my Keyspaces cluster.
It works for read timeouts, I do have logs with retry on_read_timeout
. But It doesn't retry on ReadFailure
, and my understanding is that it should do it thanks to the on_request_error
method. But it doesn't and fails immediately with the following error: Error from server: code=1300 [Replica(s) failed to execute read] message=\\"Operation failed - received 0 responses and 1 failures\\" info={\'consistency\': \'LOCAL_QUORUM\', \'required_responses\': 2, \'received_responses\': 0, \'failures\': 1}
cassandra-driver==3.25.0 cassandra-sigv4==4.0.2
Here is my retry policy:
from cassandra.policies import RetryPolicy
class KeyspacesRetryPolicy(RetryPolicy):
def __init__(self, RETRY_MAX_ATTEMPTS=5, base_delay=0.5, max_delay=10):
# retry_num starts at 0
self.max_retry_num = RETRY_MAX_ATTEMPTS-1
self.base_delay = base_delay
self.max_delay = max_delay
def __backoff(self, retry_num):
# exponential backoff delay
delay = min(self.base_delay * (2 ** (retry_num + 1)), self.max_delay)
print(f"Backing off for {delay} seconds (retry number {retry_num})")
time.sleep(delay)
def on_read_timeout(self, query, consistency, required_responses, received_responses, data_retrieved, retry_num):
if retry_num <= self.max_retry_num:
self.__backoff(retry_num)
print("retry on_read_timeout")
return self.RETRY, consistency
else:
return self.RETHROW, None
def on_write_timeout(self, query, consistency, write_type, required_responses, received_responses, retry_num):
if retry_num <= self.max_retry_num:
self.__backoff(retry_num)
print("retry on_write_timeout")
return self.RETRY, consistency
else:
return self.RETHROW, None
def on_unavailable(self, query, consistency, required_replicas, alive_replicas, retry_num):
if retry_num <= self.max_retry_num:
self.__backoff(retry_num)
print("retry on_unavailable")
return self.RETRY, consistency
else:
return self.RETHROW, None
def on_request_error(self, query, consistency, error, retry_num):
if retry_num <= self.max_retry_num:
self.__backoff(retry_num)
print("retry on_request_error")
return self.RETRY, consistency
else:
return self.RETHROW, None
Upvotes: 0
Views: 48
Reputation: 16353
Looking at the Python driver code, on_request_error
is not triggered for a ReadFailure
(from cluster.py
):
elif isinstance(response, (OverloadedErrorMessage,
IsBootstrappingErrorMessage,
TruncateError, ServerError)):
log.warning("Host %s error: %s.", host, response.summary)
if self._metrics is not None:
self._metrics.on_other_error()
cl = getattr(self.message, 'consistency_level', None)
retry = retry_policy.on_request_error(
self.query, cl, error=response,
retry_num=self._query_retries)
on_request_error
is not a catch-all for retry policies. It is only triggered for:
OverloadedErrorMessage
(the coordinator reported itself as overloaded)IsBootstrappingErrorMessage
(unlikely since a node cannot be picked as a coordinator when it is bootstrapping)TruncateError
(a TRUNCATE
request encountered an error)ServerError
(the coordinator had an internal server failure)For what it's worth, the only scenario where I've seen a ReadFailure
is when a reading a partition with lots of tombstones leading to TombstoneOverwhelmException
on the server-side.
By default, Cassandra will abort after it has iterated over 100K tombstones (tombstone_failure_threshold: 100000
in cassandra.yaml
). This can happen for heavy-delete workloads which generate a lot of tombstones such as processing a queue of items where an item (row in the partition) is deleted after it has been processed. For this reason, queues and queue-like datasets are not a good fit for Cassandra (see blog post on Queues as a Cassandra Anti-Pattern).
It doesn't make sense for the driver to retry ReadFailure
errors since it will just iterate again over deleted rows then give up after reading 100K tombstones so in this case, the retry policies are working as they should. Cheers!
Upvotes: 1