Cassandra and G1 Garbage collector stop the world event (STW)

We have a 6 node Cassandra Cluster under heavy utilization. We have been dealing a lot with garbage collector stop the world event, which can take up to 50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not even accepting new logins.

Extra details:

Cassandra Version: 3.11
Heap Size = 12 GB
We are using G1 Garbage Collector with default settings
Nodes size: 4 CPUs 28 GB RAM
The G1 GC behavior is the same across all nodes.

Any help would be very much appreciated!

Edit 1:

Checking object creation stats, it does not look healthy at all.

Edit 2:

I have tried to use the suggested settings by Chris Lohfink, here is the GC report:

The behavior remains basically the same:

Old Gen starts to fill up.
GC can't clean it properly without a full GC and a STW event.
The full GC starts to take longer, until the node is completely unresponsive.

I'm going to get the cfstats output for maximum partition size and tombstones per read asap and edit the post again.

Upvotes: 2

Answers (2)

Chris Lohfink

Reputation: 16420

Without knowing what your existing settings or possible data model problems, heres a guess of some conservative settings to use to try to reduce evacuation pauses from not having enough to-space (check gc logs):

-Xmx12G -Xms12G -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:-ReduceInitialCardMarks -XX:G1HeapRegionSize=32m

This should also help reduce the pause of the update remember set which becomes an issue and reducing humongous objects, by setting G1HeapRegionSize, which can become a problem depending on data model. Make sure -Xmn is not set.

12Gb with C* is probably more suited for using CMS for what its worth, you can get better throughput certainly. Just need to be careful of fragmentation over time with the rather large objects that can get allocated.

-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=55 -XX:MaxTenuringThreshold=3 -Xmx12G -Xms12G -Xmn3G -XX:+CMSEdenChunksRecordAlways -XX:+CMSParallelInitialMarkEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSWaitDuration=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCondCardMark

Most likely theres an issue with data model or your under provisioned though.

Upvotes: 2

Gil Tene

Reputation: 798

Have you looked at using Zing? Cassandra situations like these are a classic use case, as Zing fundamentally eliminates all GC-related glitches in Cassandra nodes and clusters.

You can see some details on the how/why in my recent "Understanding GC" talk from JavaOne (https://www.slideshare.net/howarddgreen/understanding-gc-javaone-2017). Or just skip to slides 56-60 for Cassandra-specific results.

Upvotes: 3

Cassandra and G1 Garbage collector stop the world event (STW)

Answers (2)

Related Questions