T.Tony
T.Tony

Reputation: 515

Java hangs for several seconds, but no gc pause found in gc log

I have a java application running on Linux (CentOS 7) using g1 gc, and it regularly hangs for several seconds, it looks just like gc pauses, but I can't find such long pauses in gc log.

To make sure that the java application hangs, I started a background thread who dose nothing else but print out a log every 500ms. And the log was found paused for several seconds. Here is the log and it paused at [14:31:02,834] to [14:31:05,677]

WARN [2018-07-16 14:30:57,831][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,331][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,332][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,833][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,334][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:05,677][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,177][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,678][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:07,178][clock]py.datanode.DataNodeAppEngine(196):tick...

And the gc log at the same time (grep with Total time for which application threads were stopped) :

2018-07-16T14:30:58.327+0800: 2679.423: Total time for which application threads were stopped: 0.3750533 seconds, Stopping threads took: 0.0003430 seconds
2018-07-16T14:31:05.176+0800: 2686.272: Total time for which application threads were stopped: 0.5037637 seconds, Stopping threads took: 0.0004556 seconds
2018-07-16T14:31:06.978+0800: 2688.074: Total time for which application threads were stopped: 0.0060367 seconds, Stopping threads took: 0.0003190 seconds

Besides, this java process also have some threads running native codes, they are written in C, and are not influenced by jvm. Those threads were running well, I am quite sure about this is because one of these threads are heartbeat threads and the heartbeat time out is 800MS, but no heartbeat timeout was found during the pause.

I also monitored the cpu usage, and the cpu with 12 cores was up to 80% idle.

Memory usage is not too high neither, THP(transparent huge pages) and the swap memory were also disabled.

And I found one thing I can't understand:

There is always a concurrent-mark-start near the pause, and wherever a concurrent-mark-start occurs, there is also a pause.

2018-07-16T14:30:58.489+0800: 2679.586: [GC concurrent-mark-start]

I know the concurrent mark phase won't cause STW, but I can't believe this is just an coincidence cause I reproduced this for a lot of times and it's always act like this.

And here are the CPU usage and memory usage during one of the pauses from YourKit :

enter image description here enter image description here

Thanks to @jspcal's suggestion, I got the SafepointStatistics :

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2566.430: G1IncCollectionPause             [     745          0              0    ]      [     0     0  2705     3   474    ]  0

The G1IncCollectionPause took nearly 3 seconds to get to a safe point, while the spin and the block time are both 0

Upvotes: 3

Views: 1952

Answers (2)

T.Tony
T.Tony

Reputation: 515

Finally I found it's caused by a jdk's bug about concurrent-marking for large reference arrays:

Large reference arrays cause extremely long synchronization times

And my solution is to change the large reference array to parted Two-Dimension arrays

Upvotes: 3

jspcal
jspcal

Reputation: 51894

While GC is one source of VM pauses, safepoints (stop the world pauses) can be initiated by other operations like flushing the code cache, biased locking, certain debug operations, etc. Here is a list of safepoint operations. To troubleshoot these safepoints, use these options:

Safepoints:

-XX:+UnlockDiagnosticVMOptions
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=500
-XX:+LogVMOutput
-XX:LogFile=/var/log/jvm/vm.log

GC:

-verbose:gc
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails 
-Xloggc:/var/log/jvm/garbage.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M

Upvotes: 3

Related Questions