How to resolve write timeout exception in cassandra?

I'm trying to insert 50000 records into a five node cassandra cluster. I'm using executeAsync so as to increase the performance(reduce insertion time from Application side). I tried Batchstatement with several batch sizes, but everytime I got the following exception.

Exception in thread "main" com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:259)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:175)
at

I inserted data i.e. 10000,20000 upto 40000 records without any issue. The following is the java code I wrote.

for (batchNumber = 1; batchNumber <= batches; batchNumber++) {
    BatchStatement batch = new BatchStatement();
    for (record = 1; record <= batchSize; record++) {
        batch.add(ps.bind(query));
    }
    futures.add(session.executeAsync(batch));           
}
for (ResultSetFuture future : futures) {
    resultSet = future.getUninterruptibly();
}

where ps is the prepared statement, batches is number of batches, and batchSize is the number of records in a batch.

I'm unable to understand the root cause of the issue. I thought that some of the nodes were down and when I checked all are running normally.

How should I debug the exception?

Upvotes: 1

Answers (2)

xmas79

Reputation: 5180

I see a few mistakes:

It seems you are trying to figure out what is the largest number of queries you can batch together.
It seems you are thinking that batching multiple statements will give you some sort of performance gain.
You are mistakenly reusing the same prepared statement in the loop.
You are not throttling your application at some ingestion rate.
You are not performing any exception handling, eg retrying when some batch fails.

Let's restart.

The maximum number of statements in a batch should be less than 10. The lesser the better. And by the way, the total size of the batch must be lower than whatever value there's in the YAML configuration file. Usually, if your batches are larger than 5kb then a warning will appear in your logs. If your batches gets larger than 50kb the batches will fail. You can tune these values, but you should keep in mind that a BATCH overloads the coordinator node. The larger the batch (both in term of kb or number of statements), the greater the overload on the coordinator.
You won't gain anything from batching unrelated statements together. Instead, you'll actually lose performance. This is due to how the BATCH works. One node is chosen to coordinate all the statements, and such node will be responsible for all the statements. Usually the coordinator is chosen based on the first statement, and if your statements hit multiple nodes, your coordinator will need to coordinate things belonging to different nodes as well. Instead, if you'd fire multiple separate async queries, every node would be responsible for their statements only. You'd be spreading the overload on all your cluster nodes instead of hammering on one node.
You are using prepared statements in a wrong way. You should add a new BoundStatement(ps).bind(xxxx) statement. That's an easy fix anyway.
If you have a large number of queries to run, you are running them all the way long. You're going to exhaust your application memory because it will keep adding futures to the list, and will be eventually be killed because an OOM error. Moreover, you're not giving your cluster the possibility to actually ingest all the data you're firing at it, because you can fire data way faster than your cluster can ingest. What you need to do is limit the number of futures in the list. Keep it to some value at most (eg say 1000). To perform such task you need to move your final loop with .getUninterruptibly inside the loop. This way, you throttle down the ingestion rate and will see a decreased timeout exceptions count. And depending on the application, decreased timeout exceptions means less retries, hence less queries, less overhead, better response times etc...
It is fine to have a loop with .getUninterruptibly on the Future's list, but what you should keep in mind that when your cluster is overloaded, you will get timeouts. At this point, you should catch the exception and deal with it, be it a retry, be it a re-throw, be it whatever else. I suggest you to design you model around idempotent queries, so I can retry the failed queries until they succeed without worrying about retry consequences (which can happen at driver level too!).

Hope that helps.

Upvotes: 5

OrangeDog

Reputation: 38807

That's not what BATCH is for. When you add multiple statements to a batch, Cassandra will try to apply them atomically. Either all of them will succeed or none of them will, and they all have to complete within a single query timeout.

Also, if you make more requests than can be handled simultaneously, they're going to go into a queue, and time waiting in the queue contributes to the timeout.

To get them all through without timeout, use individual statements and limit the number in flight at any one time. Alternatively, use a COPY command to load the data from a CSV.

Upvotes: 1

How to resolve write timeout exception in cassandra?

Answers (2)

Related Questions