Reputation: 4225
We have started having some serious problems with our Java EE application. Specifically, the application runs up to 99% of old generation heap within minutes after start up. No OOMs are thrown, but effectively the JVM is unresponsive. The jstat shows that old generation does not decrease in size at all, no garbage collection is going on, and kill -3 says:
Heap
PSYoungGen total 682688K, used 506415K [0xc1840000, 0xf3840000, 0xf3840000)
eden space 546176K, 92% used [0xc1840000,0xe06cd020,0xe2da0000)
from space 136512K, 0% used [0xe2da0000,0xe2da0000,0xeb2f0000)
to space 136512K, 0% used [0xeb2f0000,0xeb2f0000,0xf3840000)
PSOldGen total 1536000K, used 1535999K [0x63c40000, 0xc1840000, 0xc1840000)
object space 1536000K, 99% used [0x63c40000,0xc183fff8,0xc1840000)
The VM options are :
-Xmx2300m -Xms2300m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:SurvivorRatio=4 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseParallelGC -XX:ParallelGCThreads=4
(I changed it from having 2300m heap/1800m new gen, as an attempt to resolve the problem)
I took a heap dump of the JVM once it got to "out of memory" state (took forever) and ran Eclipse Memory Analyzer on it.
The results are quite funny. About 200Mb is occupied by objects of all kinds (there are some that own more than others), but the rest, 1.9Gb are all unreachable (may be noteworthy to say that majority is occupied by GSON objects, but I don't think it's an indication of anything, only says that we churn through a lot of GSON objects during server operation).
Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?
JVM:
$ /0/bin/java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) Server VM (build 20.12-b01, mixed mode)
When the system gets to this stall, here is what verbose GC keep printing out:
922.485: [GC [1 CMS-initial-mark: 511999K(512000K)] 1952308K(2048000K), 3.9069700 secs] [Times: user=3.91 sys=0.00, real=3.91 secs]
926.392: [CMS-concurrent-mark-start]
927.401: [Full GC 927.401: [CMS927.779: [CMS-concurrent-mark: 1.215/1.386 secs] [Times: user=5.84 sys=0.13, real=1.38 secs] (concurrent mode failure): 511999K->511999K(512000K), 9.4827600 secs] 2047999K->1957315K(2048000K), [CMS Perm : 115315K->115301K(262144K)], 9.4829860 secs] [Times: user=9.78 sys=0.01, real=9.49 secs]
937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs]
SOLVED
As Paul Bellora suggested, this was caused by too large amount of objects created within the JVM, in too short period of a time. Debugging becomes quite tedious at this point. What I ended up doing, is, instrumenting the classes using custom JVM agent. The instrumentation would count method and constructor invocations. The counts were then examined. I found that an inconspicuous single operation would create about 2 million objects, and trigger certain individual methods about 1.5 million times (no, there were no loops). The operation itself was identified by being slow comparing to others. You can use any hotspot profiler as well (something like visualVM), but I had all kinds of troubles with those, so ended up writing my own.
I still think the behavior of the JVM is a mystery. It looks like the garbage collector comes to a stall, and will not clean any more memory, yet the memory allocator expects it to (and thus no OOMs are thrown). Instead, I would have expected it to clear out all that unreachable memory. But the application behavior wouldn't be much better off, as majority of the time would have been spent garbage collecting anyway.
The agent that I used for help can be found here : https://github.com/veselov/MethodCountAgent. It's far away from being a polished piece of software.
Upvotes: 26
Views: 22361
Reputation: 40256
Based on your stats listed, I find it hard to believe you have 1.9G of unreachable data. It looks more like a GC Overhead Limit Reached.
Consider
937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs]
If this is true, then a Full GC releases 85K of data. If you did have 1.9G of unreachable code, you would see 2047999 -> ~300000
.
Also
object space 1536000K, 99%
Implies something was created and stored to in such a way it escaped a method and is now living probably forever.
I would need to see more evidence that you have 1.9G of unreachable data other then simply being told.
Upvotes: 2
Reputation: 55223
Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?
(Based on our exchange in the comments) it sounds like this is not a traditional memory leak but some piece of logic that is continuously spamming new objects such that the GC struggles to keep up under the current architecture.
The culprit could be for example some API request that is being made many, many times, or else is "stuck" in some erroneous state like the infinite pagination scenario I described. What either situation boils down to is millions of response gson objects (which point to String
s (which point to char[]
s)) being instantiated and then becoming eligible for GC.
As I said you should try and isolate problem request(s), then debug and take measurements to decide whether this is a bug or scalability issue on the part of your application or one of its libraries.
Upvotes: 15