Kevin Krieg
Kevin Krieg

Reputation: 11

Tomcat full GC every 2 minutes for first 5 hours of application run time

We have a cluster of tomcat servers being used on a very high volume website. We've noticed that for the first 5-6 hours after an application re-load that Full GC will run every 2 minutes pausing the application for anywhere between 5 and 20 seconds. After 5-6 hours the full GC will no longer run until tomcat is restarted. Traffic level isn't a factor we'll go through our peak hours with no problem.

The servers are all dual quad core with 32GB of RAM running Centos 5. Our java opts are being played with on a daily basis however here are the settings that correspond to the sample GC log below:

-server -Xmx27g -Xms27g  -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution  -Dsun.rmi.dgc.client.gcInterval=900000 -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

Log samples from shortly after application reload

191.955: [Full GC 191.958: [CMS: 1815877K->1158107K(19922944K), 3.0376720 secs] 3118102K->1158107K(27845568K), [CMS Perm : 83787K->46767K(83960K)], 3.0415080 secs] [Times: user=2.95 sys=0.10, real=3.04 secs]

215.501: [GC 215.504: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   50457968 bytes,   50457968 total
: 7456799K->111048K(7922624K), 0.0617110 secs] 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 sys=0.00, real=0.07 secs]

215.577: [GC 215.579: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:      66288 bytes,      66288 total
- age   2:   50219144 bytes,   50285432 total
: 114868K->66525K(7922624K), 0.0381810 secs] 1272975K->1224632K(27845568K), 0.0413630 secs] [Times: user=0.46 sys=0.00, real=0.04 secs]

236.177: [GC 236.180: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   45071064 bytes,   45071064 total
- age   2:      26112 bytes,   45097176 total
- age   3:   34785960 bytes,   79883136 total
: 7523165K->110355K(7922624K), 0.0921350 secs] 8681272K->1268462K(27845568K), 0.0969290 secs] [Times: user=0.95 sys=0.01, real=0.10 secs]

...

316.456: [GC 316.459: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   41430416 bytes,   41430416 total
- age   3:   22728376 bytes,   64158792 total
- age   5:   19599960 bytes,   83758752 total
- age   6:   21847616 bytes,  105606368 total
- age   7:   27667592 bytes,  133273960 total
- age   8:      10904 bytes,  133284864 total
- age   9:   31824256 bytes,  165109120 total
: 7650333K->215213K(7922624K), 0.1332630 secs] 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 sys=0.01, real=0.14 secs]

338.851: [GC 338.854: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   40678840 bytes,   40678840 total
- age   2:   27075936 bytes,   67754776 total
- age   4:   20399720 bytes,   88154496 total
- age   6:   19271008 bytes,  107425504 total
- age   7:   21655032 bytes,  129080536 total
- age   8:   27118800 bytes,  156199336 total
- age   9:      10904 bytes,  156210240 total
- age  10:   31747808 bytes,  187958048 total
: 7671853K->285541K(7922624K), 0.1456470 secs] 8829960K->1443648K(27845568K), 0.1503540 secs] [Times: user=1.62 sys=0.01, real=0.15 secs]

343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm : 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87 sys=0.02, real=3.41 secs]

**Last Full GC**

20517.892: [GC 20517.898: [ParNew
Desired survivor size 238583808 bytes, new threshold 15 (max 15)
- age   1:   33948208 bytes,   33948208 total
- age   2:      88280 bytes,   34036488 total
- age   3:   19872472 bytes,   53908960 total
- age   4:   16072608 bytes,   69981568 total
- age   5:   15718712 bytes,   85700280 total
- age   6:   15771016 bytes,  101471296 total
- age   7:   16895976 bytes,  118367272 total
- age   8:   24233728 bytes,  142601000 total
: 7618727K->200950K(7922624K), 0.1728420 secs] 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 sys=0.01, real=0.18 secs]

20526.469: [Full GC 20526.475: [CMS: 9175755K->9210800K(19922944K), 33.1161300 secs] 13632232K->9210800K(27845568K), [CMS Perm : 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 sys=0.02, real=33.12 secs]


**Log samples after Full GC no longer runs**

74412.335: [GC 74412.340: [ParNew
Desired survivor size 238583808 bytes, new threshold 11 (max 15)
- age   1:   43614032 bytes,   43614032 total
- age   2:   41194144 bytes,   84808176 total
- age   3:   27392888 bytes,  112201064 total
- age   5:   22753896 bytes,  134954960 total
- age   7:   24439608 bytes,  159394568 total
- age   8:   24015704 bytes,  183410272 total
- age   9:   24080848 bytes,  207491120 total
- age  10:   24715800 bytes,  232206920 total
- age  11:   21844024 bytes,  254050944 total
: 7813778K->312911K(7922624K), 0.3329150 secs] 24426351K->16967791K(27845568K), 0.3416730 secs] [Times: user=3.69 sys=0.02, real=0.35 secs]

74445.007: [GC 74445.012: [ParNew
Desired survivor size 238583808 bytes, new threshold 11 (max 15)
- age   1:   42690688 bytes,   42690688 total
- age   2:   37055848 bytes,   79746536 total
- age   3:   37107464 bytes,  116854000 total
- age   4:   26223088 bytes,  143077088 total
- age   6:   22478672 bytes,  165555760 total
- age   8:   24259744 bytes,  189815504 total
- age   9:   23862672 bytes,  213678176 total
- age  10:   23911864 bytes,  237590040 total
- age  11:   24496888 bytes,  262086928 total
: 7769547K->344030K(7922624K), 0.3088470 secs] 24424428K->17021685K(27845568K), 0.3175830 secs] [Times: user=3.57 sys=0.01, real=0.32 secs]

74475.169: [GC 74475.175: [ParNew
Desired survivor size 238583808 bytes, new threshold 10 (max 15)
- age   1:   42011656 bytes,   42011656 total
- age   2:   33147608 bytes,   75159264 total
- age   3:   32391640 bytes,  107550904 total
- age   4:   36516584 bytes,  144067488 total
- age   5:   25940856 bytes,  170008344 total
- age   7:   22037464 bytes,  192045808 total
- age   9:   24130040 bytes,  216175848 total
- age  10:   23724672 bytes,  239900520 total
- age  11:   23329640 bytes,  263230160 total
: 7803184K->331046K(7922624K), 0.3091600 secs] 24480839K->17033619K(27845568K), 0.3179630 secs] [Times: user=3.56 sys=0.01, real=0.32 secs]

If we don't restart this server for the next week it'll run fine.

Any help would be greatly appreciated.

EDIT: The problem will only show up if the WAR file is re-deployed. Restarting tomcat on it's own does not cause this problem.

Upvotes: 0

Views: 4015

Answers (2)

Sean
Sean

Reputation: 7737

A theory on what could be happening here could come from the stats of your ehcache.

Understanding when the CMS collector kicks off the full GC is important. The following is true [Reference]

Unlike the other collectors, the CMS collector does not start an old generation collection when the old generation becomes full. Instead, it attempts to start a collection early enough so that it can complete before that happens

Essentially the CMS collector makes the decision when to run the GC based on the previous levels and how fast it is filling. It is doing this to reduce pause times in the future.

So when you are seeing all these full collections early on after the application has started up, the JVM could be determining that it has a flood of memory being allocated, therefore it is running frequent GC's to protect itself from reaching the point where an OOM error would occur. If you look at the stats of your GC's, the first full collection starts when there is only 1.8gb of the 27gb tenured heap consumed. The last occurs at 9.2gb of the 27gb.

At this point, when the full GC's are stopping the collector has determined it is not under stress and that the memory allocation has leveled off some. Is it possible at the 5-6 hour mark is when the applications caches have fully populated and are not allocating much more memory for its needs. You could create a tool to view the stats on the hits, misses, sizes of the cache over time and monitor their sizes that way. At some point when they stop growing you could see if they correspond to the time your GC's stop. Personally I have only used home grown tools, but you could try something like the EHCache Monitor tool available on their site.

Also, Have you run the GC logs through any tool such as the IBM Diagnostic tools or MAT to get a breakdown of the throughput your application gets during that time. With CMS collector not all pauses are stop-the-world, so some pause times could be quicker then you think

Upvotes: 0

maximdim
maximdim

Reputation: 8169

I would try to connect to one of the instances with jvisualvm with VisualGC plugin installed. It would show you initial distribution of memory for each pool in JVM and how it changes overtime. There is also a memory sampler and profiler which could be useful for determining memory state at given time.

Also I'm not sure how did you come up with JVM parameters you're using (27Gb? are you keeping some sort of in memory cache in-process?) but I would usually start with bare minimum and then tweak it only after problem been identified (e.g. to small New pool, etc). Just try to start it with something like:

-Xmx.. -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -Xloggc:/tmp/gc.log -XX:+HeapDumpOnOutOfMemoryError

On your amount of memory java should start in 'server' mode automatically and it's usually pretty good at dynamically allocating required memory pools as long as there are enough memory.

Upvotes: 1

Related Questions