Reputation: 31918
I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).
The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:
JVM runs with the following options:
CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"
I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.
I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.
My application does not, directly, use any native code.
Any ideas of where I should look next?
Edit - more info:
1) There is no client vm in this JDK:
[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
2) Changing the O/S is not possible.
3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.
4) I've done static analysis on my application but nothing serious came up.
5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.
6) /var/log/messages does not contain any useful information before or during the time of the crash
More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).
After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.
Upvotes: 10
Views: 2894
Reputation: 8476
Do you have compiler output? i.e. PrintCompilation
(and if you're feeling particularly brave, LogCompilation).
I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.
Basically what I'd do is;
If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.
some references
The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try
if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).
Upvotes: 4
Reputation: 75406
Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.
Upvotes: 1
Reputation: 1398
Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.
Upvotes: 2
Reputation: 26733
If I was you, I'd do the following:
Let us know how this was sorted out!
Upvotes: 1
Reputation: 8534
Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.
Upvotes: 1
Reputation: 4141
Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.
Can you reproduce the problem faster if:
One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.
I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.
Good luck, Jacob
Upvotes: 1
Reputation: 502
A few ideas:
Upvotes: 3