Reputation: 1175
I'm using CMS for GC, however usage of CPU will be very high after every two months.
There's one piece of GC log when situation got worse, you may find the long STW.
3519696.386: [GC [1 CMS-initial-mark: 8617524K(12582912K)] 17105967K(23907584K), 4.9369140 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
3519701.324: [CMS-concurrent-mark-start]
3519709.419: [CMS-concurrent-mark: 8.096/8.096 secs] [Times: user=16.17 sys=0.00, real=8.09 secs]
3519709.420: [CMS-concurrent-preclean-start]
3519709.442: [CMS-concurrent-preclean: 0.023/0.023 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
3519709.442: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 3519714.691: [CMS-concurrent-abortable-preclean: 3.345/5.248 secs] [Times: user=3.36 sys=0.00, real=5.25 secs]
3519714.692: [GC[YG occupancy: 8489655 K (11324672 K)]3519714.692: [Rescan (parallel) , 8.4072250 secs]3519723.099: [weak refs processing, 0.0000190 secs]3519723.099: [scrub string table, 0.0008130 secs] [1 CMS-remark: 8617524K(12582912K)] 17107180K(23907584K), 8.4081940 secs] [Times: user=65.71 sys=0.15, real=8.41 secs]
3519723.100: [CMS-concurrent-sweep-start]
3519725.451: [CMS-concurrent-sweep: 2.350/2.350 secs] [Times: user=2.36 sys=0.00, real=2.35 secs]
3519725.451: [CMS-concurrent-reset-start]
3519725.478: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
3519727.480: [GC [1 CMS-initial-mark: 8617522K(12582912K)] 17107229K(23907584K), 4.9378950 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
3519732.418: [CMS-concurrent-mark-start]
My GC parameters:
java -server -Xmx24g -Xms24g -XX:NewSize=12g -XX:MaxNewSize=12g -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=24g -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:ReservedCodeCacheSize=128m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=68
There are 9 CPUs and 64G memory installed on my server.
Could you help to figure out the key point why it becomes worse monthly?
Upvotes: 1
Views: 655
Reputation: 8379
In your logs, I do not see normal young GC. CMS Stop-the-World phases are designed to piggy back on young space collection to be effective.
3519727.480: [GC [1 CMS-initial-mark: 8617522K(12582912K)] 17107229K(23907584K), 4.9378950 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
Here CMS-initial-mark
have to scan 8.6 GiB of young space in one thread. If it would be invoked right after young GC young space occupation would be order of magnitude smaller.
Same true for CMS-remark
.
Young have large young space so you end up in situation, when you old space is collected more frequently that young space.
Here are few options to remedy that
-XX:CMSWaitDuration=3600000
let CMS-initial-mark
wait up to one hour until next young GC.-XX:+CMSScavengeBeforeRemark
let CMS-remark
force young collection to produce predictable pause times.CMS-initial-mark
More details are available in this article.
Upvotes: 1
Reputation: 460
Ok so let's see a bit in detail. I'll note first that all times seem to be in user, not in sys, so the main suspects are the JVM and application.
The GC is triggered at old gen. occupancy of 8617524K of a capacity of 12582912K. Total heap usage is 17105967K, size 23907584K.
Initial mark takes ~5s.
3519696.386: [GC [1 CMS-initial-mark: 8617524K(12582912K)] 17105967K(23907584K), 4.9369140 secs] [Times: user=4.94 sys=0.00, real=4.94 secs]
AFAIK the initial mark processes only GC roots. You can see which are these here, but the fact that it's taking so much is odd. My first suspicion would be that this is influenced by time to safepoints, so maybe enable:
-XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1
The concurrent mark phase takes 8s
3519709.419: [CMS-concurrent-mark: 8.096/8.096 secs] [Times: user=16.17 sys=0.00, real=8.09 secs]
this is scanning the live objects
Preclean is comparatively fast.
Abortable preclean gets cancelled at 5s, which AFAIK comes is configurable with CMSMaxAbortablePrecleanTime. Digging on this option I see that having minor collections at this stage is desirable, and failure to do so can result in variable large pauses in CMS. Increasing that CMSMaxAbortablePrecleanTime, and activating CMSScavengeBeforeRemark. Check out this post by Jon Masamitsu.
Young generation is at 8G and rescanning took 8s, which seems too much. Again same comment re. safepoints.
3519714.692: [GC[YG occupancy: 8489655 K (11324672 K)]3519714.692: [Rescan (parallel) , 8.4072250 secs]3519723.099: [weak refs processing, 0.0000190 secs]3519723.099: [scrub string table, 0.0008130 secs] [1 CMS-remark: 8617524K(12582912K)] 17107180K(23907584K), 8.4081940 secs] [Times: user=65.71 sys=0.15, real=8.41 secs]
Note that the young gen. size actually increased during this period: 8617524K
The final cleanup at concurrent-sweep takes 2.35 seconds, and the heap contents don't seem to change substantially. You still have roughly the same young and heap usage.
So summarizing I see two points:
Any more details about your app, etc. would definitely help nail it down better. I hope that helps a bit.
Upvotes: 4