Reputation: 348
In production with 3 nodes, local quorum, sporadically insert fails and we just get Cassandra::Errors::TimeoutError
and not Cassandra::Errors::WriteTimeoutError
, which I think tells it's not able to connect to the node/s but I don't get Cassandra::Errors::NoHostsAvailable: All attempted hosts failed
I look at the cassandra logs there is nothing there, application logs shows the error
It's happening like 1k per day, and usually retries from caller side results in success...
my guess driver is having some issue
ruby '~> 2.7'
gem "cassandra-driver", "~> 3.2.5"
consistency: :local_quorum,
load_balancing_policies = {
dc_aware_round_robin: Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new(
datacenter,
cassandra_used_hosts_per_remote_dc
),
round_robin: Cassandra::LoadBalancing::Policies::RoundRobin.new
}
CASSANDRA_CONNECT_TIMEOUT_MS: '600'
CASSANDRA_CONSISTENCY: LOCAL_QUORUM
CASSANDRA_RECONNECT_INITIAL_INTERVAL_MS: '100'
CASSANDRA_RECONNECT_MAX_INTERVAL_MS: '3000'
CASSANDRA_RECONNECT_MAX_RETRIES: '5'
CASSANDRA_RETRIES: '5'
CASSANDRA_RETRY_MAX_MS: '3000'
CASSANDRA_RETRY_MIN_MS: '100'
So looked at the lib/cassandra/future.rb
# Returns future value or raises future error
#
# @note This method blocks until a future is resolved or a times out
#
# @param timeout [nil, Numeric] a maximum number of seconds to block
# current thread for while waiting for this future to resolve. Will
# wait indefinitely if passed `nil`.
#
# @raise [Errors::TimeoutError] raised when wait time exceeds the timeout
# @raise [Exception] raises when the future has been resolved with an
# error. The original exception will be raised.
#
# @return [Object] the value that the future has been resolved with
def get(timeout = nil)
@signal.get(timeout)
end
Cassandra::Errors::TimeoutError
Timed out
Crashed in non-app: cassandra/future.rb in get
cassandra/future.rb in get at line 402
cassandra/session.rb in execute at line 127
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:637:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:402:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/session.rb:127:in 'execute'
Upvotes: 0
Views: 92
Reputation: 348
So i figured out the issue, just realize I never answered the question. Reason was large partition size, cassandra logs were bleeding with messages like
WARN [CompactionExecutor:170358] BigTableWriter.java:258 - Writing large partition xxx/yyy:1716208:2023-09-25-16-10 (103.262MiB) to sstable /data/cassandra/data/xxx/yyy-a88t665njhgs833sbjjkdl/nb-4343435-big-Data.db
Whenever flush to memtable happens for there > 100 MB partitions, it drastically increase the latency.
Solution -
It was rather simple our partition key was som eother col + bucket (extract yyyy-mm-dd-hh-mm from our clustering column of type timeuuid) and we are chopping the last digit from minute, so esentially anything within a 10 min window goes to a single partition, I changed it to 1 min. It stopped the bleeding while we redesign the table
Upvotes: 0
Reputation: 16353
The errors you mentioned are all different from each other and are mutually exclusive.
A TimeoutError
is a client-side error that is raised by the driver when it has not heard back from the coordinator (server-side) within :timeout
seconds. This means that the driver gave up waiting for the coordinator to respond.
A WriteTimeoutError
is a server-side error returned by the coordinator when replicas have not acknowledged a write request within write_request_timeout_in_ms
(in cassandra.yaml
) usually because the commitlog/
disk is not able to keep up.
A NoHostsAvailable
error is raised by the driver when all the hosts it attempted to contact are not available or unresponsive. In this situation, the driver was not able to contact any node at all so a coordinator never got picked to coordinate the request which is completely different to the first two errors above.
If you are intermittently seeing TimeoutError
and WriteTimeoutError
for INSERT
statements then there's a good chance that the errors are getting raised during peak application traffic and indicates that your cluster cannot cope with the load.
It would be a good time to review the capacity of your cluster and either throttle the app traffic or consider increasing the capacity of your cluster by adding more nodes. Cheers!
Upvotes: 2
Reputation: 649
Based on your description,
It's happening like 1k per day, and usually retries from caller side results in success
the cassandra cluster is not rightly sized for the workload that you're putting on it.
You may have to appropriately size the cluster or scale it up according to the load on it. This is just too broad of a topic or you could simply pick and choose a Serverless SaaS offering like this and dont have to worry scaling as it'd do it for you automatically.
Couple links to help you with that are,
Upvotes: 3