Reputation: 143
Java app's log time suddenly jumped from 22:20 to 00:33 of the next day. Before 22:20 or after 00:33 the app run correctly. Looks like the app was frozen between 22:20 to 00:33.
There's a thread pool in my app. I use DelayQueue to handle timeout business tasks(Runnables). So I submit a task to the thread pool to poll timeout business task and handle it
delayFuture = pool.submit(() -> {
logger.debug("delayed task execution thread: {}", Thread.currentThread().getName());
while (true) {
try {
AbstractTask task = delayQueue.poll(5, TimeUnit.SECONDS);
if (task != null && !task.isDone()) {
task.setOvertime(true);
task.getExecutingThread().interrupt();
pool.submit(() -> {
task.fireTimeout();
});
}
} catch (Exception e) {
logger.error("delay task error", e);
}
}
});
All business tasks do similar work with fixed delay. When business tasks are executed, they finish several steps one by one in one thread of the thread pool. But the log showed that some tasks "blocked" in one same step doing
LinkedBlockingQueue.poll(timeout, unit);
It's really weird that the log time jumped from 22:20 to 00:33 and recovered normally again.
2019-02-16 22:20:20,627 INFO (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:31) - device some_device sent control cmd response
2019-02-16 22:20:20,647 INFO (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:36) - user some_user received device some_device control cmd response
2019-02-16 22:20:25,809 INFO (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:38) - device some_device disconnected, cancel device heartbeat: true
// the next line will be loged after LinkedBlockingQueue.poll(timeout, unit);
2019-02-17 00:33:18,751 INFO (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:46) - user some_user received device some_device offline message
2019-02-17 00:33:18,747 INFO (biz_thread_56:com.some.company.task.TaskManager.lambda$null$0:54) - fire task timeout operation: user get device offline notification, devMac=some_device, userToken=some_user
2019-02-17 00:33:18,951 INFO (biz_thread_56:com.some.company.task.AbstractTask.resetState:174) - device some_device disconnected, cancel device heartbeat: false
2019-02-17 00:33:18,951 INFO (biz_thread_56:com.some.company.task.AbstractTask.resetState:183) - user some_user disconnected, cancel user heartbeat: true
2019-02-17 00:33:18,962 INFO (biz_thread_64:com.some.company.task.TaskManager.exec:170) - task is overtime: task=user get device offline notification, devMac=some_device, userToken=some_user
There is a jump between line 3 and 4. But normally the time interval between the two steps should be little than 4 minutes. Looks like the app was frozen between 22:20 and 00:33.
Some other useful info:
The thread pool is fixed-sized;
The app is running in a vm but there is another java app which has logs between 22:20 and 00:33, so the vm did not sleep in that time range;
There is another fixed delay task printing each business task status but it also did not print anything between 22:20 and 00:33;
There is a http server based on Vertx to provide RESTful services. The RESTful services are non-responsive when requested when the problem occurred, i.e. the browser is blocked when issuing a http GET request to the http server. But telnet worked to the http server.
Any suggestions will be appreciated.
Update(2019-02-26 18:13)
It happened again! My app froze at 2019-02-26 17:30:17,576 and recovered at 2019-02-26 19:40:29,655, about 2 hours, 10 minutes and 12 seconds. Last occurrence of this problem last for 2 hours, 12 minutes and 53 seconds.
When the problem occurred, both jstack and jmap command failed to execute:
[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jstack -F 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
at sun.tools.jstack.JStack.main(JStack.java:106)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)
[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jmap -heap 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)
The last GC log before the problem
[GC pause (G1 Evacuation Pause) (young), 0.0388874 secs]
[Parallel Time: 26.9 ms, GC Workers: 2]
[GC Worker Start (ms): Min: 22597871.6, Avg: 22597871.6, Max: 22597871.7, Diff: 0.0]
[Ext Root Scanning (ms): Min: 3.8, Avg: 3.9, Max: 4.0, Diff: 0.2, Sum: 7.9]
[Update RS (ms): Min: 9.8, Avg: 9.8, Max: 9.8, Diff: 0.1, Sum: 19.6]
[Processed Buffers: Min: 245, Avg: 266.0, Max: 287, Diff: 42, Sum: 532]
[Scan RS (ms): Min: 2.3, Avg: 2.3, Max: 2.4, Diff: 0.2, Sum: 4.7]
[Code Root Scanning (ms): Min: 0.2, Avg: 0.7, Max: 1.2, Diff: 1.0, Sum: 1.4]
[Object Copy (ms): Min: 9.4, Avg: 9.9, Max: 10.5, Diff: 1.1, Sum: 19.8]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 26.7, Avg: 26.7, Max: 26.7, Diff: 0.0, Sum: 53.5]
[GC Worker End (ms): Min: 22597898.4, Avg: 22597898.4, Max: 22597898.4, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.9 ms]
[Other: 11.1 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 8.4 ms]
[Ref Enq: 0.1 ms]
[Redirty Cards: 0.2 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.7 ms]
[Eden: 604.0M(604.0M)->0.0B(604.0M) Survivors: 10.0M->10.0M Heap: 707.4M(1024.0M)->103.4M(1024.0M)]
[Times: user=0.06 sys=0.00, real=0.04 secs]
My app run with options
-Xms1g -Xmx1g -server -XX:+PrintGCDetails -verbose:gc -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+HeapDumpOnOutOfMemoryError
top command gives
top -H -p 10914
Tasks: 545 total, 0 running, 0 sleeping, 545 stopped, 0 zombie
Cpu0 : 0.7%us, 1.0%sy, 0.0%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 1.0%st
Cpu1 : 4.0%us, 0.7%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5991196k total, 5858276k used, 132920k free, 225448k buffers
Swap: 6291448k total, 132044k used, 6159404k free, 3820256k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10914 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.10 java
10915 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:03.17 java
10916 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.68 java
10917 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.71 java
10918 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:20.87 java
10919 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.07 java
10920 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.13 java
10921 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.05 java
10922 cloud-us 20 0 4151m 1.0g 15m T 0.0 17.3 0:00.16 java
netstat command showed that there are 16 TCP connections with CLOSE_WAIT or ESTABLISHED status, and based on the log, 16 is also the number of chained tasks that did not finished normally.
My app periodically prints logs of status of threads in business thread pool. The last one before the problem
"biz_thread_0":"TIMED_WAITING",
"biz_thread_1":"TIMED_WAITING",
"biz_thread_2":"TIMED_WAITING",
"biz_thread_3":"TIMED_WAITING",
"biz_thread_4":"TIMED_WAITING",
"biz_thread_5":"TIMED_WAITING",
"biz_thread_6":"WAITING",
"biz_thread_7":"TIMED_WAITING",
"biz_thread_8":"TIMED_WAITING",
"biz_thread_9":"TIMED_WAITING",
"biz_thread_10":"WAITING",
"biz_thread_11":"TIMED_WAITING",
"biz_thread_12":"WAITING",
"biz_thread_13":"TIMED_WAITING",
"biz_thread_14":"TIMED_WAITING",
"biz_thread_15":"TIMED_WAITING",
"biz_thread_16":"TIMED_WAITING",
"biz_thread_17":"TIMED_WAITING",
"biz_thread_18":"TIMED_WAITING",
"biz_thread_19":"TIMED_WAITING",
"biz_thread_20":"TIMED_WAITING",
"biz_thread_21":"TIMED_WAITING",
"biz_thread_22":"TIMED_WAITING",
"biz_thread_23":"TIMED_WAITING",
"biz_thread_24":"TIMED_WAITING",
"biz_thread_25":"TIMED_WAITING",
"biz_thread_26":"WAITING",
"biz_thread_27":"WAITING",
"biz_thread_28":"WAITING",
"biz_thread_29":"WAITING",
"biz_thread_30":"WAITING",
"biz_thread_31":"TIMED_WAITING",
"biz_thread_32":"TIMED_WAITING",
"biz_thread_33":"TIMED_WAITING",
"biz_thread_34":"RUNNABLE",
"biz_thread_35":"TIMED_WAITING",
"biz_thread_36":"TIMED_WAITING",
"biz_thread_37":"WAITING",
"biz_thread_38":"TIMED_WAITING",
"biz_thread_39":"TIMED_WAITING",
"biz_thread_40":"TIMED_WAITING",
"biz_thread_41":"TIMED_WAITING",
"biz_thread_42":"TIMED_WAITING",
"biz_thread_43":"TIMED_WAITING",
"biz_thread_44":"WAITING",
"biz_thread_45":"WAITING",
"biz_thread_46":"TIMED_WAITING",
"biz_thread_47":"TIMED_WAITING",
"biz_thread_48":"TIMED_WAITING",
"biz_thread_49":"TIMED_WAITING",
"biz_thread_50":"TIMED_WAITING",
"biz_thread_51":"WAITING",
"biz_thread_52":"TIMED_WAITING",
"biz_thread_53":"TIMED_WAITING",
"biz_thread_54":"TIMED_WAITING",
"biz_thread_55":"TIMED_WAITING",
"biz_thread_56":"WAITING",
"biz_thread_57":"TIMED_WAITING",
"biz_thread_58":"TIMED_WAITING",
"biz_thread_59":"TIMED_WAITING",
"biz_thread_60":"TIMED_WAITING",
"biz_thread_61":"TIMED_WAITING",
"biz_thread_62":"TIMED_WAITING",
"biz_thread_63":"TIMED_WAITING",
"biz_thread_64":"TIMED_WAITING",
"biz_thread_65":"WAITING",
"biz_thread_66":"TIMED_WAITING",
"biz_thread_67":"WAITING",
"biz_thread_68":"WAITING",
"biz_thread_69":"TIMED_WAITING",
"biz_thread_70":"TIMED_WAITING",
"biz_thread_71":"TIMED_WAITING",
"biz_thread_72":"WAITING",
"biz_thread_73":"TIMED_WAITING",
"biz_thread_74":"TIMED_WAITING",
"biz_thread_75":"TIMED_WAITING",
"biz_thread_76":"WAITING",
"biz_thread_77":"TIMED_WAITING",
"biz_thread_78":"TIMED_WAITING",
"biz_thread_79":"TIMED_WAITING",
"biz_thread_80":"TIMED_WAITING",
"biz_thread_81":"TIMED_WAITING",
"biz_thread_82":"WAITING",
"biz_thread_83":"TIMED_WAITING",
"biz_thread_84":"TIMED_WAITING",
"biz_thread_85":"TIMED_WAITING",
"biz_thread_86":"TIMED_WAITING",
"biz_thread_87":"TIMED_WAITING",
"biz_thread_88":"TIMED_WAITING",
"biz_thread_89":"WAITING",
"biz_thread_90":"TIMED_WAITING",
"biz_thread_91":"TIMED_WAITING",
"biz_thread_92":"TIMED_WAITING",
"biz_thread_93":"TIMED_WAITING",
"biz_thread_94":"WAITING",
"biz_thread_95":"TIMED_WAITING",
"biz_thread_96":"TIMED_WAITING",
"biz_thread_97":"TIMED_WAITING",
"biz_thread_98":"TIMED_WAITING",
"biz_thread_99":"TIMED_WAITING",
"biz_thread_100":"TIMED_WAITING",
"biz_thread_101":"TIMED_WAITING",
"biz_thread_102":"TIMED_WAITING",
"biz_thread_103":"WAITING",
"biz_thread_114":"WAITING",
"biz_thread_113":"TIMED_WAITING",
"biz_thread_112":"WAITING",
"biz_thread_111":"TIMED_WAITING",
"biz_thread_110":"TIMED_WAITING",
"biz_thread_109":"TIMED_WAITING",
"biz_thread_108":"TIMED_WAITING",
"biz_thread_107":"TIMED_WAITING",
"biz_thread_106":"TIMED_WAITING",
"biz_thread_105":"WAITING",
"biz_thread_104":"TIMED_WAITING",
"biz_thread_115":"TIMED_WAITING",
"biz_thread_116":"TIMED_WAITING",
"biz_thread_122":"TIMED_WAITING",
"biz_thread_121":"TIMED_WAITING",
"biz_thread_120":"TIMED_WAITING",
"biz_thread_119":"TIMED_WAITING",
"biz_thread_118":"TIMED_WAITING",
"biz_thread_117":"TIMED_WAITING",
"biz_thread_126":"TIMED_WAITING",
"biz_thread_125":"TIMED_WAITING",
"biz_thread_124":"WAITING",
"biz_thread_123":"WAITING",
"biz_thread_127":"WAITING"
Upvotes: 2
Views: 2507
Reputation: 143
Finally!
I do not know which java option or other things helped, but there is full thread dump in the nohup stdout redirected file(There was no thread dump for the first time when it occurred).
2019-02-26 19:40:29
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.192-b12 mixed mode):
"biz_thread_87" #104 prio=5 os_prio=0 tid=0x00007f93e76fd800 nid=0x2b24 waiting for monitor entry [0x00007f938bfbe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:217)
- waiting to lock <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:207)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1999)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1868)
at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:194)
at com.some.company.task.impl.StartTask.interruptProcess(StartTask.java:17)
at com.some.company.common.task.AbstractTask.process(AbstractTask.java:88)
at com.some.company.common.task.TaskManager.exec(TaskManager.java:174)
at com.some.company.common.task.Context.lambda$resubmit$0(Context.java:426)
at com.some.company.common.task.Context$$Lambda$97/519533636.run(Unknown Source)
at com.some.company.common.task.TaskManager.lambda$delayExec$2(TaskManager.java:121)
at com.some.company.common.task.TaskManager$$Lambda$37/873634936.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"nioEventLoopGroup-173-1" #315 prio=10 os_prio=0 tid=0x00007f939420e000 nid=0x2ced runnable [0x00007f937b3e8000]
java.lang.Thread.State: RUNNABLE
at java.io.RandomAccessFile.writeBytes(Native Method)
at java.io.RandomAccessFile.write(RandomAccessFile.java:525)
at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.writeToDestination(RollingRandomAccessFileManager.java:109)
- eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:288)
- eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.flush(RollingRandomAccessFileManager.java:128)
- locked <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:179)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:218)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2005)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1876)
at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:184)
at com.some.company.UserWifiClient.sendHeartbeat(UserWifiEppClient.java:39)
at com.some.company.AbstractNettyClient.lambda$scheduleHeartbeat$0(AbstractNettyClient.java:146)
at com.some.company.AbstractNettyClient$$Lambda$73/2134853806.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:133)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
There are 101 BLOCKED threads and 94 threads "waiting to lock <0x00000000c06725d8>".
So I think it must be a misuse of log4j Logger. In AbstractTask the "logger" is created like below
protected final static Logger logger = LoggerFactory.getLogger(AbstractTask.class);
All AbstractTask implementations just use this "logger" but the logger created this way introduced contention when multi-thread trying to use it. Now I create logger this way in AbstractTask
protected Logger logger = LoggerFactory.getLogger(getClass());
Then each real task should have their own logger object to use and will not face race-condition problem.
See this: Should logger be private static or not
What a big lesson!
By the way, did JVM auto dump threads to stdout sometime? I found the above "Full thread dump ..." in the nohup stdout redirected file.
Was this kind of thread-blocked problem a dead-lock? The app recovered about 2 hours later. If it was a dead-lock, how could it recover?
Update(2019-03-11)
Actually after I modified the way of using log4j logger, I accidentally came across this problem again, even after one week smooth run of the app. I realized that maybe it was caused by jmap. I run below command last Friday for several hours
while [ true ]; do jmap -heap 340; echo "=================================================="; sleep 30; done
Then at about 17:00 the app froze again! So if your app run correctly, run jmap
or jstack
against the app pid as fewer as possible! These commands will hang JVM. When they are executed too frequently, JVM will possibly get-frozen.
Upvotes: 4