Why this particular minor GC was so slow

Question

We have a Jira 6.3.10 application running on tomcat 6.0.35 and JDK 7u60. OS is Centos 6.2, with kernel 2.6.32-220.7.1.el6.x86_64. From time to times we notice pauses in the application. We found a correlation with the GC.

The startup options for memory are: -Xms16384m -Xmx16384m -XX:NewSize=6144m -XX:MaxPermSize=512m -XX:+UseCodeCacheFlushing -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:ReservedCodeCacheSize=512m -XX:+DisableExplicitGC

The problem is that I can not explain why a particular Minor GC took almost 7 seconds. See the collection at: 2015-09-17T14:59:42.485+0000. The users noticed a pause of a bit over 1 minute.

From what I could read at http://blog.ragozin.info/2011/06/understanding-gc-pauses-in-jvm-hotspots.html, I think it is the Tcard_scan that was the one that determined this slow collection, but I am not sure and I could not explain why this could happen.

2015-09-17T14:57:03.824+0000: 3160725.220: [GC2015-09-17T14:57:03.824+0000: 3160725.220: [ParNew: 5112700K->77061K(5662336K), 0.0999740 secs] 10048034K->5017436K(16148096K), 0.1002730 secs] [Times: user=1.01 sys=0.02, real=0.10 secs]
2015-09-17T14:57:36.631+0000: 3160758.027: [GC2015-09-17T14:57:36.631+0000: 3160758.027: [ParNew: 5110277K->127181K(5662336K), 0.0841060 secs] 10050652K->5075489K(16148096K), 0.0843680 secs] [Times: user=0.87 sys=0.02, real=0.09 secs]
2015-09-17T14:57:59.494+0000: 3160780.890: [GC2015-09-17T14:57:59.494+0000: 3160780.890: [ParNew: 5160397K->104929K(5662336K), 0.1033160 secs] 10108705K->5056258K(16148096K), 0.1036150 secs] [Times: user=0.62 sys=0.00, real=0.11 secs]
2015-09-17T14:58:27.069+0000: 3160808.464: [GC2015-09-17T14:58:27.069+0000: 3160808.465: [ParNew: 5138145K->86797K(5662336K), 0.0844890 secs] 10089474K->5039063K(16148096K), 0.0847790 secs] [Times: user=0.68 sys=0.01, real=0.09 secs]
2015-09-17T14:58:43.489+0000: 3160824.885: [GC2015-09-17T14:58:43.489+0000: 3160824.885: [ParNew: 5120013K->91000K(5662336K), 0.0588270 secs] 10072279K->5045124K(16148096K), 0.0590680 secs] [Times: user=0.53 sys=0.01, real=0.06 secs]
2015-09-17T14:59:03.831+0000: 3160845.227: [GC2015-09-17T14:59:03.832+0000: 3160845.227: [ParNew: 5124216K->89921K(5662336K), 0.1018980 secs] 10078340K->5047918K(16148096K), 0.1021850 secs] [Times: user=0.56 sys=0.01, real=0.10 secs]
2015-09-17T14:59:42.485+0000: 3160883.880: [GC2015-09-17T14:59:42.485+0000: 3160883.880: [ParNew: 5123137K->98539K(5662336K), 6.9674580 secs] 10081134K->5061766K(16148096K), 6.9677100 secs] [Times: user=102.14 sys=0.05, real=6.97 secs]
2015-09-17T15:00:17.679+0000: 3160919.075: [GC2015-09-17T15:00:17.680+0000: 3160919.075: [ParNew: 5131755K->141258K(5662336K), 0.1189970 secs] 10094982K->5107030K(16148096K), 0.1194650 secs] [Times: user=0.80 sys=0.00, real=0.12 secs]
2015-09-17T15:01:20.149+0000: 3160981.545: [GC2015-09-17T15:01:20.149+0000: 3160981.545: [ParNew: 5174474K->118871K(5662336K), 0.1251710 secs] 10140246K->5089067K(16148096K), 0.1255370 secs] [Times: user=0.63 sys=0.00, real=0.12 secs]
2015-09-17T15:03:07.323+0000: 3161088.718: [GC2015-09-17T15:03:07.323+0000: 3161088.719: [ParNew: 5152087K->80387K(5662336K), 0.0782410 secs] 10122283K->5055601K(16148096K), 0.0785610 secs] [Times: user=0.57 sys=0.01, real=0.07 secs]
2015-09-17T15:03:26.396+0000: 3161107.791: [GC2015-09-17T15:03:26.396+0000: 3161107.791: [ParNew: 5113538K->66134K(5662336K), 0.0697170 secs] 10088753K->5044322K(16148096K), 0.0699990 secs] [Times: user=0.48 sys=0.01, real=0.07 secs]
2015-09-17T15:03:47.185+0000: 3161128.580: [GC2015-09-17T15:03:47.185+0000: 3161128.581: [ParNew: 5099350K->62874K(5662336K), 0.0692700 secs] 10077538K->5043879K(16148096K), 0.0695140 secs] [Times: user=0.61 sys=0.02, real=0.07 secs]
2015-09-17T15:04:04.503+0000: 3161145.899: [GC2015-09-17T15:04:04.503+0000: 3161145.899: [ParNew: 5096090K->63684K(5662336K), 0.0709490 secs] 10077095K->5047678K(16148096K), 0.0712390 secs] [Times: user=0.54 sys=0.01, real=0.07 secs]
2015-09-17T15:04:48.013+0000: 3161189.409: [GC2015-09-17T15:04:48.013+0000: 3161189.409: [ParNew: 5096900K->74854K(5662336K), 0.1530160 secs] 10080894K->5061766K(16148096K), 0.1533520 secs] [Times: user=0.76 sys=0.00, real=0.15 secs]

We have 198GB RAM. The server is not swapping actively. This particular Jira instance has pretty high usage. There are some automated tools poking this instance all the time. Memory status on the server:

$ cat /proc/meminfo
MemTotal:       198333224 kB
MemFree:        13194296 kB
Buffers:           93948 kB
Cached:         10236288 kB
SwapCached:      1722248 kB
Active:         168906584 kB
Inactive:       10675040 kB
Active(anon):   163755088 kB
Inactive(anon):  5508552 kB
Active(file):    5151496 kB
Inactive(file):  5166488 kB
Unevictable:        4960 kB
Mlocked:            4960 kB
SwapTotal:       6193136 kB
SwapFree:             12 kB
Dirty:             14040 kB
Writeback:             0 kB
AnonPages:      167534556 kB
Mapped:          1341076 kB
Shmem:              9196 kB
Slab:            2117816 kB
SReclaimable:    1258104 kB
SUnreclaim:       859712 kB
KernelStack:       51048 kB
PageTables:       431780 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    105359748 kB
Committed_AS:   187566824 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      680016 kB
VmallocChunk:   34255555544 kB
HardwareCorrupted:     0 kB
AnonHugePages:  79947776 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5604 kB
DirectMap2M:     2078720 kB
DirectMap1G:    199229440 kB

The other Jira instances running on the same machine were unaffected. We are running 30 Jira instances on this machine.

Michael · Accepted Answer

In General

It's probably another process taking resource (or swapping increasing overhead):

Sometimes the OS activities such as the swap space or networking activity happening at the time when GC is taking place can make the GC pauses last much longer. These pauses can be of the order of few seconds to some minutes.

If your system is configured to use swap space, Operating System may move inactive pages of memory of the JVM process to the swap space, to free up memory for the currently active process which may be the same process or a different process on the system. Swapping is very expensive as it requires disk accesses which are much slower as compared to the physical memory access. So, if during a garbage collection the system needs to perform swapping, the GC would seem to run for a very long time.

Source: https://blogs.oracle.com/poonam/entry/troubleshooting_long_gc_pauses

The article also provides some general pointers that may be of interest.

Specifically

However, in this case I'd say the issue is more likely to be the number of JVM's running on your server.

30 * distinct heaps, plus JVM overhead will probably result in a significant memory usage. If the total heap allocation is more than 75% of the physical memory, you're likely to hit the issue described above (swap activity would reveal this).

More importantly, thread CPU contention is likely to be the real killer here. As a rule of thumb, I'd have no more JVMs than the number of logical CPUs (if the instances are going to be used concurrently).

Also, I try to ensure the number of GC threads doesn't exceed more than 4 times the number of virtual CPU cores available.

At 30 * 18 (based on your comment), that's potentially 540 GC threads. I'm assuming your system doesn't have 135 logical cores?

If my requirements for responsiveness are lower, I'd potentially go to a ratio of 8 to one. But that's as far as I'd go on a single system - separate JVM's compete against each other for resources.

Suggestion

Decrease the number of ParallelGCThreads, with the aim of getting the total below the 4x threshold. I've that's not practical, set the low-priority JVMs to a low value (2) and higher-priority JVMs appropriately (4 or 8 possibly)?

Also, I'm not sure of the implications of setting ConcGCThreads = 0 as I'd assume you can't have CMS without any worker threads...? The behaviour if the value is not set is that the JVM decides for itself based on the system. I'd expect this to be the behaviour for a 0 setting too - which would probably be a value far too high on a shared system. Try setting it explicitly.

Why this particular minor GC was so slow

Answers (1)

Related Questions