AVI
AVI

Reputation: 348

Simple INSERT sporadically fails with Cassandra::Errors::TimeoutError, Cassandra::Errors::WriteTimeoutError

In production with 3 nodes, local quorum, sporadically insert fails and we just get Cassandra::Errors::TimeoutError and not Cassandra::Errors::WriteTimeoutError, which I think tells it's not able to connect to the node/s but I don't get Cassandra::Errors::NoHostsAvailable: All attempted hosts failed

I look at the cassandra logs there is nothing there, application logs shows the error

It's happening like 1k per day, and usually retries from caller side results in success...

my guess driver is having some issue

    ruby '~> 2.7'
    gem "cassandra-driver", "~> 3.2.5"
    consistency:           :local_quorum,

    load_balancing_policies = {
        dc_aware_round_robin: Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new(
            datacenter,
            cassandra_used_hosts_per_remote_dc
        ),
        round_robin: Cassandra::LoadBalancing::Policies::RoundRobin.new
    }
CASSANDRA_CONNECT_TIMEOUT_MS: '600'
CASSANDRA_CONSISTENCY: LOCAL_QUORUM
CASSANDRA_RECONNECT_INITIAL_INTERVAL_MS: '100'
CASSANDRA_RECONNECT_MAX_INTERVAL_MS: '3000'
CASSANDRA_RECONNECT_MAX_RETRIES: '5'
CASSANDRA_RETRIES: '5'
CASSANDRA_RETRY_MAX_MS: '3000'
CASSANDRA_RETRY_MIN_MS: '100'

So looked at the lib/cassandra/future.rb

# Returns future value or raises future error
    #
    # @note This method blocks until a future is resolved or a times out
    #
    # @param timeout [nil, Numeric] a maximum number of seconds to block
    #   current thread for while waiting for this future to resolve. Will
    #   wait indefinitely if passed `nil`.
    #
    # @raise [Errors::TimeoutError] raised when wait time exceeds the timeout
    # @raise [Exception] raises when the future has been resolved with an
    #   error. The original exception will be raised.
    #
    # @return [Object] the value that the future has been resolved with
    def get(timeout = nil)
      @signal.get(timeout)
    end
Cassandra::Errors::TimeoutError
Timed out

Crashed in non-app: cassandra/future.rb in get

cassandra/future.rb in get at line 402

cassandra/session.rb in execute at line 127

/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:637:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:402:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/session.rb:127:in 'execute'

Upvotes: 0

Views: 92

Answers (3)

AVI
AVI

Reputation: 348

So i figured out the issue, just realize I never answered the question. Reason was large partition size, cassandra logs were bleeding with messages like

WARN  [CompactionExecutor:170358] BigTableWriter.java:258 - Writing large partition xxx/yyy:1716208:2023-09-25-16-10 (103.262MiB) to sstable /data/cassandra/data/xxx/yyy-a88t665njhgs833sbjjkdl/nb-4343435-big-Data.db

Whenever flush to memtable happens for there > 100 MB partitions, it drastically increase the latency.

Solution -

It was rather simple our partition key was som eother col + bucket (extract yyyy-mm-dd-hh-mm from our clustering column of type timeuuid) and we are chopping the last digit from minute, so esentially anything within a 10 min window goes to a single partition, I changed it to 1 min. It stopped the bleeding while we redesign the table

Upvotes: 0

Erick Ramirez
Erick Ramirez

Reputation: 16353

The errors you mentioned are all different from each other and are mutually exclusive.

A TimeoutError is a client-side error that is raised by the driver when it has not heard back from the coordinator (server-side) within :timeout seconds. This means that the driver gave up waiting for the coordinator to respond.

A WriteTimeoutError is a server-side error returned by the coordinator when replicas have not acknowledged a write request within write_request_timeout_in_ms (in cassandra.yaml) usually because the commitlog/ disk is not able to keep up.

A NoHostsAvailable error is raised by the driver when all the hosts it attempted to contact are not available or unresponsive. In this situation, the driver was not able to contact any node at all so a coordinator never got picked to coordinate the request which is completely different to the first two errors above.

If you are intermittently seeing TimeoutError and WriteTimeoutError for INSERT statements then there's a good chance that the errors are getting raised during peak application traffic and indicates that your cluster cannot cope with the load.

It would be a good time to review the capacity of your cluster and either throttle the app traffic or consider increasing the capacity of your cluster by adding more nodes. Cheers!

Upvotes: 2

Madhavan
Madhavan

Reputation: 649

Based on your description,

It's happening like 1k per day, and usually retries from caller side results in success

the cassandra cluster is not rightly sized for the workload that you're putting on it.

You may have to appropriately size the cluster or scale it up according to the load on it. This is just too broad of a topic or you could simply pick and choose a Serverless SaaS offering like this and dont have to worry scaling as it'd do it for you automatically.

Couple links to help you with that are,

Upvotes: 3

Related Questions