Junjie
Junjie

Reputation: 1175

Adjust parameters for GC (CMS)

I'm using CMS for GC, however usage of CPU will be very high after every two months.

There's one piece of GC log when situation got worse, you may find the long STW.

3519696.386: [GC [1 CMS-initial-mark: 8617524K(12582912K)] 17105967K(23907584K), 4.9369140 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
3519701.324: [CMS-concurrent-mark-start]
3519709.419: [CMS-concurrent-mark: 8.096/8.096 secs] [Times: user=16.17 sys=0.00, real=8.09 secs]
3519709.420: [CMS-concurrent-preclean-start]
3519709.442: [CMS-concurrent-preclean: 0.023/0.023 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
3519709.442: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 3519714.691: [CMS-concurrent-abortable-preclean: 3.345/5.248 secs] [Times: user=3.36 sys=0.00, real=5.25 secs]
3519714.692: [GC[YG occupancy: 8489655 K (11324672 K)]3519714.692: [Rescan (parallel) , 8.4072250 secs]3519723.099: [weak refs processing, 0.0000190 secs]3519723.099: [scrub string table, 0.0008130 secs] [1 CMS-remark: 8617524K(12582912K)] 17107180K(23907584K), 8.4081940 secs] [Times: user=65.71 sys=0.15, real=8.41 secs]
3519723.100: [CMS-concurrent-sweep-start]
3519725.451: [CMS-concurrent-sweep: 2.350/2.350 secs] [Times: user=2.36 sys=0.00, real=2.35 secs]
3519725.451: [CMS-concurrent-reset-start]
3519725.478: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
3519727.480: [GC [1 CMS-initial-mark: 8617522K(12582912K)] 17107229K(23907584K), 4.9378950 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
3519732.418: [CMS-concurrent-mark-start]

My GC parameters:

java -server -Xmx24g -Xms24g -XX:NewSize=12g -XX:MaxNewSize=12g -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=24g -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:ReservedCodeCacheSize=128m  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=68

There are 9 CPUs and 64G memory installed on my server.

Could you help to figure out the key point why it becomes worse monthly?

Upvotes: 1

Views: 655

Answers (2)

Alexey Ragozin
Alexey Ragozin

Reputation: 8379

In your logs, I do not see normal young GC. CMS Stop-the-World phases are designed to piggy back on young space collection to be effective.

3519727.480: [GC [1 CMS-initial-mark: 8617522K(12582912K)] 17107229K(23907584K), 4.9378950 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]

Here CMS-initial-mark have to scan 8.6 GiB of young space in one thread. If it would be invoked right after young GC young space occupation would be order of magnitude smaller.

Same true for CMS-remark.

Young have large young space so you end up in situation, when you old space is collected more frequently that young space.

Here are few options to remedy that

  • -XX:CMSWaitDuration=3600000 let CMS-initial-mark wait up to one hour until next young GC.
  • -XX:+CMSScavengeBeforeRemark let CMS-remark force young collection to produce predictable pause times.
  • Use latest Java 7/8 which has parallel CMS-initial-mark

More details are available in this article.

Upvotes: 1

Galo Navarro
Galo Navarro

Reputation: 460

Ok so let's see a bit in detail. I'll note first that all times seem to be in user, not in sys, so the main suspects are the JVM and application.

The GC is triggered at old gen. occupancy of 8617524K of a capacity of 12582912K. Total heap usage is 17105967K, size 23907584K.

Initial mark takes ~5s.

3519696.386: [GC [1 CMS-initial-mark: 8617524K(12582912K)] 17105967K(23907584K), 4.9369140 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]

AFAIK the initial mark processes only GC roots. You can see which are these here, but the fact that it's taking so much is odd. My first suspicion would be that this is influenced by time to safepoints, so maybe enable:

-XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1

The concurrent mark phase takes 8s

3519709.419: [CMS-concurrent-mark: 8.096/8.096 secs] [Times: user=16.17 sys=0.00, real=8.09 secs]

this is scanning the live objects

Preclean is comparatively fast.

Abortable preclean gets cancelled at 5s, which AFAIK comes is configurable with CMSMaxAbortablePrecleanTime. Digging on this option I see that having minor collections at this stage is desirable, and failure to do so can result in variable large pauses in CMS. Increasing that CMSMaxAbortablePrecleanTime, and activating CMSScavengeBeforeRemark. Check out this post by Jon Masamitsu.

Young generation is at 8G and rescanning took 8s, which seems too much. Again same comment re. safepoints.

3519714.692: [GC[YG occupancy: 8489655 K (11324672 K)]3519714.692: [Rescan (parallel) , 8.4072250 secs]3519723.099: [weak refs processing, 0.0000190 secs]3519723.099: [scrub string table, 0.0008130 secs] [1 CMS-remark: 8617524K(12582912K)] 17107180K(23907584K), 8.4081940 secs] [Times: user=65.71 sys=0.15, real=8.41 secs]

Note that the young gen. size actually increased during this period: 8617524K

The final cleanup at concurrent-sweep takes 2.35 seconds, and the heap contents don't seem to change substantially. You still have roughly the same young and heap usage.

So summarizing I see two points:

  • Your heap is large, you're reaching the CMSInitiatingOccupancyFraction and triggering CMS, and a lot of time seems to be going on scanning live objects. Either way, check the sizepoint times to see whether this could be improved.
  • GC doesn't really collect much so you're probably in one of these situations:
    • You want to keep lots of long-lived objects (e.g.: a cache). In this case, you want to increase the CMSInitiatingOccuppancyFraction (as you expect the old gen to get very full). But also watch that you do not promote any mid or short lived objects, because those will eventually (in a day, or in 2 months) lead to long GCs. That is: avoid churn in old gen.
    • You're generating lots of short-mid lived objects, you need to avoid promotions. Decrease allocations, increase Eden.

Any more details about your app, etc. would definitely help nail it down better. I hope that helps a bit.

Upvotes: 4

Related Questions