Sam
Sam

Reputation: 446

Compaction cause out of memory error and shutdown the Cassandra process

Using Cassandra 3.11 with 18 nodes and 32 GB of memory in production, the weekly compaction job throws the following error in system.log and debug.log then the Cassandra process dies and I have to start Cassandra.

ERROR [ReadStage-4] JVMStabilityInspector.java:142 - JVM state determined to be unstable.  Exiting forcefully due to: java.lang.OutOfMemoryError: Direct buffer memory

ERROR [ReadStage-5] JVMStabilityInspector.java:142 - JVM state determined to be unstable.  Exiting forcefully due to: java.lang.OutOfMemoryError: Direct buffer memory

DEBUG [ReadRepairStage:29517] ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-3787568997731881233, 0000000004ca5c48) (2b912cd2000e6bb5b481fa849e438ae4 vs 962e899380ce22ac970c6be0014707de)

Java heap size is 8 GB

/opt/apache-cassandra-3.11.0/conf/jvm.options
-Xms8G
-Xmx8G

Any other workaround rather than increasing heap size to prevent out-of-memory issue during compaction?

Upvotes: 0

Views: 683

Answers (1)

Erick Ramirez
Erick Ramirez

Reputation: 16393

On their own, the errors you posted don't indicate to me that they're related to compaction. Instead the thread IDs (ReadStage-*) indicate that they're the result of read requests.

If anything, the DigestMismatchException is a bit more telling because it indicates that your replicas are out-of-sync. If you're seeing dropped mutations in the logs, that's a clear indication that the nodes are overloaded. That symptom is more in-line with the OOM you're seeing which I believe is a result of your cluster not able to keep up with the app traffic.

With 32GB RAM systems, we recommend bumping up the heap to 16GB for production systems. I realise you prefer not to do this but it's an appropriate action and it's something I would do if I were managing the cluster.

I'd also make sure that data/ and commitlog/ are on separate disks/volumes so they're not competing for the same IO (unless you have direct-attached SSDs). If you're still seeing lots of dropped mutations, consider increasing the capacity of your cluster by adding more nodes. Cheers!

Upvotes: 2

Related Questions