JVM performance degrades after random amount of time

In short I've got a performance problem that "randomly" shows up in 1 JVM at a time that may have been running for days fine before, but I can't seem to find the root cause. I'm leaning towards something eating up the threadpool, but haven't been able to track that down.

I've run through about everything I can think of to track this down, any suggestions would be great!

(I've got Jprofiler, yourkit, and jvisualvm at my disposal, and I've tried running with them all and run comparisons within JVMs)

So here's the setup. We run 40 JVMs in a heavily used testing environment (10 per hardware machine). They use an open source product called UltraESB(2.3.0) that leverages thread pools for asynchronous request/reply processing, but in our case stateless header based routing of JMS messages. We have a less heavily but still commonly used setup in our development environment, and we have never seen this problem.

So we see pretty frequent minor GCs(one every few minutes) and rarely see major GCs(once a day or so). We're using hotspot Java 1.7_71, on centos 6.7 (Haswell CPU bug is patched)

Occasionally (Seemingly completely randomly to me) one of the JVMs will start to perform poorly (We have monitors + metrics on application performance). In normal cases we process a message in <1ms. Once we encounter the error scenario we start to see processing times in the hundreds(100-200) of milliseconds. As we run these over the period of a few weeks we'll see several poor performing JVMs. A recycle cleans things up and they'll run well for another matter of days. As JVMs error out, we start to see them have nearly exactly the same processing times as other instances that have encountered performance problems(including instances on other hardware). This isn't too strange as they're the exact same code base and JMS load balanced round robin so they process nearly identical numbers of messages.

I triggered this performance impact by turning on CPU performance profiling. See the graph: Blue was good process until I turned on CPU tracing and it started to perform poorly

The interesting thing is that even after profiling was disabled the poor performance continued.

Things I've tried measuring

Nothing I've tried has pointed me towards any silver bullets.

GC monitoring - GC duration and CPU utilization seems consistent between reference and poorly performing JVMs.

GC launch options:

GC_OPTS="-XX:+PrintGCDetails \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=100 \
-XX:+ParallelRefProcEnabled \
-XX:+UnlockExperimentalVMOptions \
-XX:-ResizePLAB \
-XX:G1NewSizePercent=50 \
-XX:G1MaxNewSizePercent=50 \
-XX:+PrintAdaptiveSizePolicy \
-Xloggc:/logs/applogs/${instancename}/gc.${DATE}.log"

CPU sampling There are so many things going on inside the JVM nothing sticks out as a difference to me. Turning this on has caused problems, but doesn't always depending on the sampling settings.

Thread pool usage Stats are exported as an MBean and the threadpool (spring 3.2.4 ThreadPoolTaskExecutor) and threads in use seem the same as other well performing instances.

Upvotes: 1

Answers (2)

bbrower

Reputation: 43

Our problem disappeared when we split the worker thread pool from the thread pool that Spring DMLC listener containers were using. Still wasn't able to find a root cause, but the problem has been solved.

Upvotes: 1

Philipp Lengauer

Reputation: 1959

You may try http://mevss.jku.at/AntTracks. Its a research JVM, recording your memory behavior. It is then able to display heap properties over time and also to visualize the heap at any point in time offline based on the trace. This VM is built to have as little impact on the behavior, thus not distorting the application behavior like badly configured sampling can. This of course only helps if u expect memory / GC to play a role in your problem.

Upvotes: 1

JVM performance degrades after random amount of time

Things I've tried measuring

Answers (2)

Related Questions