sechstein
sechstein

Reputation: 306

Cassandra Python driver custom retryPolicy doesn't catch ReadFailure

I created a custom retryPolicy to implement backoff for my Keyspaces cluster.

It works for read timeouts, I do have logs with retry on_read_timeout. But It doesn't retry on ReadFailure, and my understanding is that it should do it thanks to the on_request_error method. But it doesn't and fails immediately with the following error: Error from server: code=1300 [Replica(s) failed to execute read] message=\\"Operation failed - received 0 responses and 1 failures\\" info={\'consistency\': \'LOCAL_QUORUM\', \'required_responses\': 2, \'received_responses\': 0, \'failures\': 1}

cassandra-driver==3.25.0 cassandra-sigv4==4.0.2

Here is my retry policy:

from cassandra.policies import RetryPolicy

class KeyspacesRetryPolicy(RetryPolicy):
    def __init__(self, RETRY_MAX_ATTEMPTS=5, base_delay=0.5, max_delay=10):
        # retry_num starts at 0
        self.max_retry_num = RETRY_MAX_ATTEMPTS-1
        self.base_delay = base_delay 
        self.max_delay = max_delay

    def __backoff(self, retry_num):
        # exponential backoff delay
        delay = min(self.base_delay * (2 ** (retry_num + 1)), self.max_delay)
        print(f"Backing off for {delay} seconds (retry number {retry_num})")
        time.sleep(delay)

    def on_read_timeout(self, query, consistency, required_responses, received_responses, data_retrieved, retry_num):
        if retry_num <= self.max_retry_num:
            self.__backoff(retry_num)
            print("retry on_read_timeout")
            return self.RETRY, consistency
        else:
            return self.RETHROW, None 

    def on_write_timeout(self, query, consistency, write_type, required_responses, received_responses, retry_num):
        if retry_num <= self.max_retry_num:
            self.__backoff(retry_num)
            print("retry on_write_timeout")
            return self.RETRY, consistency
        else:
            return self.RETHROW, None

    def on_unavailable(self, query, consistency, required_replicas, alive_replicas, retry_num):
        if retry_num <= self.max_retry_num:
            self.__backoff(retry_num)
            print("retry on_unavailable")
            return self.RETRY, consistency
        else:
            return self.RETHROW, None 

    def on_request_error(self, query, consistency, error, retry_num):
        if retry_num <= self.max_retry_num:
            self.__backoff(retry_num)
            print("retry on_request_error")
            return self.RETRY, consistency
        else:
            return self.RETHROW, None

Upvotes: 0

Views: 48

Answers (1)

Erick Ramirez
Erick Ramirez

Reputation: 16353

Looking at the Python driver code, on_request_error is not triggered for a ReadFailure (from cluster.py):

                elif isinstance(response, (OverloadedErrorMessage,
                                           IsBootstrappingErrorMessage,
                                           TruncateError, ServerError)):
                    log.warning("Host %s error: %s.", host, response.summary)
                    if self._metrics is not None:
                        self._metrics.on_other_error()
                    cl = getattr(self.message, 'consistency_level', None)
                    retry = retry_policy.on_request_error(
                        self.query, cl, error=response,
                        retry_num=self._query_retries)

on_request_error is not a catch-all for retry policies. It is only triggered for:

  • OverloadedErrorMessage (the coordinator reported itself as overloaded)
  • IsBootstrappingErrorMessage (unlikely since a node cannot be picked as a coordinator when it is bootstrapping)
  • TruncateError (a TRUNCATE request encountered an error)
  • ServerError (the coordinator had an internal server failure)

For what it's worth, the only scenario where I've seen a ReadFailure is when a reading a partition with lots of tombstones leading to TombstoneOverwhelmException on the server-side.

By default, Cassandra will abort after it has iterated over 100K tombstones (tombstone_failure_threshold: 100000 in cassandra.yaml). This can happen for heavy-delete workloads which generate a lot of tombstones such as processing a queue of items where an item (row in the partition) is deleted after it has been processed. For this reason, queues and queue-like datasets are not a good fit for Cassandra (see blog post on Queues as a Cassandra Anti-Pattern).

It doesn't make sense for the driver to retry ReadFailure errors since it will just iterate again over deleted rows then give up after reading 100K tombstones so in this case, the retry policies are working as they should. Cheers!

Upvotes: 1

Related Questions