Tom
Tom

Reputation: 3176

Garbage collection tuning a production application

I've been tasked with tuning a production application that consists of a Spring MVC REST interface serving large (~0mb - 100mb) json documents from a Gemfire in memory cache backend. The application runs on a CentOS server inside Tomcat 7 on JDK 1.6. We realized that the application needed to be tuned because we were seeing frequent stop the world old generation garbage collections which would eventually lead to java.lang.OutOfMemoryError: GC overhead limit exceeded errors if left unattended.

Through some trial and error and monitoring I've managed to tune the application with these parameters:

-Xms20g 
-Xmx20g 

-XX:PermSize=256m 
-XX:MaxPermSize=256m

-XX:NewSize=8g 
-XX:MaxNewSize=8g

-XX:SurvivorRatio=8
-XX:+DisableExplicitGC
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70

The garbage collection behavior that I'm seeing now (48 hours under heavy test load) is that eden space collection is happening about once every 10 seconds and lasting about .04 seconds. The old generation is not growing at all after 48 hours and there have been 0 collections in that space.

My question is should I be concerned about not having the old generation garbage collected? Overall does this look like a healthy tuning?

Edit: For anyone who cares my GC log is available here http://filebin.ca/2U8awo1KTS1D/udf-gc.log.0

Upvotes: 2

Views: 711

Answers (2)

Ivo
Ivo

Reputation: 444

Tuning Garbage Collection is no different for generic performance tuning in the sense that - without having the requirements you can (for non-trivial applications at least) effectively keep improving forever. At some point the improvements no longer matter for the practical use case. That is why you should be having the goals in place.

The goals regarding GC should be derived from the generic performance requirements. These in turn are usually describing three dimensions

  • Latency. Or more precisely, acceptable latency distribution per service published by application. For example - 99% of login() operations must complete under 500ms and worst case can not exceed 2,500ms.
  • Throughput. How many operations per time unit must be completed. Tougher to measure for large monoliths, but if running microservices you can express this as "1,000 login operations processed per second".
  • Capacity. Adding more resources & scaling out will improve the situation, but for practical matters, things such as the monthly AWS bill will set limits in this regard.

Having these requirements in place, you can start building/deriving from them and if necessary, optimizing further. The company I am affiliated with published a rather thorough handbook about GC tuning recently so you can check more from the GC tuning sections of the handbook.

Upvotes: 0

the8472
the8472

Reputation: 43160

My question is should I be concerned about not having the old generation garbage collected?

The logs look fine. Given the trends the old gen occupancy grows very slowly. So it will take several days until it becomes full enough for a concurrent marking cycle to be initiated.

Overall does this look like a healthy tuning?

it seems like you're giving it much more memory than it needs.

Old gen occupancy is around 2G / 12G. This means you could probably shrink it to 4G and still take many hours before a concurrent cycle gets started

Most young objects only live to age 1 (out of 15) in the young generation. This means the young generation could be shrunk too without increasing object promotion too much

-XX:CMSInitiatingOccupancyFraction=70

That should be combined with XX:+UseCMSInitiatingOccupancyOnly

Upvotes: 3

Related Questions