Allen Xu
Allen Xu

Reputation: 143

Weird Java threads getting-frozen problem

Java app's log time suddenly jumped from 22:20 to 00:33 of the next day. Before 22:20 or after 00:33 the app run correctly. Looks like the app was frozen between 22:20 to 00:33.

There's a thread pool in my app. I use DelayQueue to handle timeout business tasks(Runnables). So I submit a task to the thread pool to poll timeout business task and handle it

delayFuture = pool.submit(() -> {
    logger.debug("delayed task execution thread: {}", Thread.currentThread().getName());
    while (true) {
        try {
            AbstractTask task = delayQueue.poll(5, TimeUnit.SECONDS);
            if (task != null && !task.isDone()) {
                task.setOvertime(true);
                task.getExecutingThread().interrupt();

                pool.submit(() -> {
                    task.fireTimeout();
                });
            }
        } catch (Exception e) {
            logger.error("delay task error", e);
        }
    }
});

All business tasks do similar work with fixed delay. When business tasks are executed, they finish several steps one by one in one thread of the thread pool. But the log showed that some tasks "blocked" in one same step doing

LinkedBlockingQueue.poll(timeout, unit);

It's really weird that the log time jumped from 22:20 to 00:33 and recovered normally again.

2019-02-16 22:20:20,627 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:31) - device some_device sent control cmd response
2019-02-16 22:20:20,647 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:36) - user some_user received device some_device control cmd response

2019-02-16 22:20:25,809 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:38) - device some_device disconnected, cancel device heartbeat: true
// the next line will be loged after LinkedBlockingQueue.poll(timeout, unit);
2019-02-17 00:33:18,751 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:46) - user some_user received device some_device offline message

2019-02-17 00:33:18,747 INFO  (biz_thread_56:com.some.company.task.TaskManager.lambda$null$0:54) - fire task timeout operation: user get device offline notification, devMac=some_device, userToken=some_user
2019-02-17 00:33:18,951 INFO  (biz_thread_56:com.some.company.task.AbstractTask.resetState:174) - device some_device disconnected, cancel device heartbeat: false
2019-02-17 00:33:18,951 INFO  (biz_thread_56:com.some.company.task.AbstractTask.resetState:183) - user some_user disconnected, cancel user heartbeat: true

2019-02-17 00:33:18,962 INFO  (biz_thread_64:com.some.company.task.TaskManager.exec:170) - task is overtime: task=user get device offline notification, devMac=some_device, userToken=some_user

There is a jump between line 3 and 4. But normally the time interval between the two steps should be little than 4 minutes. Looks like the app was frozen between 22:20 and 00:33.

Some other useful info:

  1. The thread pool is fixed-sized;

  2. The app is running in a vm but there is another java app which has logs between 22:20 and 00:33, so the vm did not sleep in that time range;

  3. There is another fixed delay task printing each business task status but it also did not print anything between 22:20 and 00:33;

  4. There is a http server based on Vertx to provide RESTful services. The RESTful services are non-responsive when requested when the problem occurred, i.e. the browser is blocked when issuing a http GET request to the http server. But telnet worked to the http server.

Any suggestions will be appreciated.

Update(2019-02-26 18:13)

It happened again! My app froze at 2019-02-26 17:30:17,576 and recovered at 2019-02-26 19:40:29,655, about 2 hours, 10 minutes and 12 seconds. Last occurrence of this problem last for 2 hours, 12 minutes and 53 seconds.

When the problem occurred, both jstack and jmap command failed to execute:

[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jstack -F 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
    at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
    at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
    at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
    at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
    at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
    at sun.tools.jstack.JStack.main(JStack.java:106)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)

[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jmap -heap 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
    at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
    at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
    at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
    at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
    at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)

The last GC log before the problem

[GC pause (G1 Evacuation Pause) (young), 0.0388874 secs]
   [Parallel Time: 26.9 ms, GC Workers: 2]
      [GC Worker Start (ms): Min: 22597871.6, Avg: 22597871.6, Max: 22597871.7, Diff: 0.0]
      [Ext Root Scanning (ms): Min: 3.8, Avg: 3.9, Max: 4.0, Diff: 0.2, Sum: 7.9]
      [Update RS (ms): Min: 9.8, Avg: 9.8, Max: 9.8, Diff: 0.1, Sum: 19.6]
         [Processed Buffers: Min: 245, Avg: 266.0, Max: 287, Diff: 42, Sum: 532]
      [Scan RS (ms): Min: 2.3, Avg: 2.3, Max: 2.4, Diff: 0.2, Sum: 4.7]
      [Code Root Scanning (ms): Min: 0.2, Avg: 0.7, Max: 1.2, Diff: 1.0, Sum: 1.4]
      [Object Copy (ms): Min: 9.4, Avg: 9.9, Max: 10.5, Diff: 1.1, Sum: 19.8]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 26.7, Avg: 26.7, Max: 26.7, Diff: 0.0, Sum: 53.5]
      [GC Worker End (ms): Min: 22597898.4, Avg: 22597898.4, Max: 22597898.4, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 11.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 8.4 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.7 ms]
   [Eden: 604.0M(604.0M)->0.0B(604.0M) Survivors: 10.0M->10.0M Heap: 707.4M(1024.0M)->103.4M(1024.0M)]
 [Times: user=0.06 sys=0.00, real=0.04 secs] 

My app run with options

-Xms1g -Xmx1g -server -XX:+PrintGCDetails -verbose:gc -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+HeapDumpOnOutOfMemoryError

top command gives

top -H -p 10914

Tasks: 545 total,   0 running,   0 sleeping, 545 stopped,   0 zombie
Cpu0  :  0.7%us,  1.0%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  1.0%st
Cpu1  :  4.0%us,  0.7%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5991196k total,  5858276k used,   132920k free,   225448k buffers
Swap:  6291448k total,   132044k used,  6159404k free,  3820256k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                 
10914 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.10 java                                                                     
10915 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:03.17 java                                                                     
10916 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.68 java                                                                     
10917 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.71 java                                                                     
10918 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:20.87 java                                                                     
10919 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.07 java                                                                     
10920 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.13 java                                                                     
10921 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.05 java                                                                     
10922 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.16 java

netstat command showed that there are 16 TCP connections with CLOSE_WAIT or ESTABLISHED status, and based on the log, 16 is also the number of chained tasks that did not finished normally.

My app periodically prints logs of status of threads in business thread pool. The last one before the problem

    "biz_thread_0":"TIMED_WAITING",
    "biz_thread_1":"TIMED_WAITING",
    "biz_thread_2":"TIMED_WAITING",
    "biz_thread_3":"TIMED_WAITING",
    "biz_thread_4":"TIMED_WAITING",
    "biz_thread_5":"TIMED_WAITING",
    "biz_thread_6":"WAITING",
    "biz_thread_7":"TIMED_WAITING",
    "biz_thread_8":"TIMED_WAITING",
    "biz_thread_9":"TIMED_WAITING",
    "biz_thread_10":"WAITING",
    "biz_thread_11":"TIMED_WAITING",
    "biz_thread_12":"WAITING",
    "biz_thread_13":"TIMED_WAITING",
    "biz_thread_14":"TIMED_WAITING",
    "biz_thread_15":"TIMED_WAITING",
    "biz_thread_16":"TIMED_WAITING",
    "biz_thread_17":"TIMED_WAITING",
    "biz_thread_18":"TIMED_WAITING",
    "biz_thread_19":"TIMED_WAITING",
    "biz_thread_20":"TIMED_WAITING",
    "biz_thread_21":"TIMED_WAITING",
    "biz_thread_22":"TIMED_WAITING",
    "biz_thread_23":"TIMED_WAITING",
    "biz_thread_24":"TIMED_WAITING",
    "biz_thread_25":"TIMED_WAITING",
    "biz_thread_26":"WAITING",
    "biz_thread_27":"WAITING",
    "biz_thread_28":"WAITING",
    "biz_thread_29":"WAITING",
    "biz_thread_30":"WAITING",
    "biz_thread_31":"TIMED_WAITING",
    "biz_thread_32":"TIMED_WAITING",
    "biz_thread_33":"TIMED_WAITING",
    "biz_thread_34":"RUNNABLE",
    "biz_thread_35":"TIMED_WAITING",
    "biz_thread_36":"TIMED_WAITING",
    "biz_thread_37":"WAITING",
    "biz_thread_38":"TIMED_WAITING",
    "biz_thread_39":"TIMED_WAITING",
    "biz_thread_40":"TIMED_WAITING",
    "biz_thread_41":"TIMED_WAITING",
    "biz_thread_42":"TIMED_WAITING",
    "biz_thread_43":"TIMED_WAITING",
    "biz_thread_44":"WAITING",
    "biz_thread_45":"WAITING",
    "biz_thread_46":"TIMED_WAITING",
    "biz_thread_47":"TIMED_WAITING",
    "biz_thread_48":"TIMED_WAITING",
    "biz_thread_49":"TIMED_WAITING",
    "biz_thread_50":"TIMED_WAITING",
    "biz_thread_51":"WAITING",
    "biz_thread_52":"TIMED_WAITING",
    "biz_thread_53":"TIMED_WAITING",
    "biz_thread_54":"TIMED_WAITING",
    "biz_thread_55":"TIMED_WAITING",
    "biz_thread_56":"WAITING",
    "biz_thread_57":"TIMED_WAITING",
    "biz_thread_58":"TIMED_WAITING",
    "biz_thread_59":"TIMED_WAITING",
    "biz_thread_60":"TIMED_WAITING",
    "biz_thread_61":"TIMED_WAITING",
    "biz_thread_62":"TIMED_WAITING",
    "biz_thread_63":"TIMED_WAITING",
    "biz_thread_64":"TIMED_WAITING",
    "biz_thread_65":"WAITING",
    "biz_thread_66":"TIMED_WAITING",
    "biz_thread_67":"WAITING",
    "biz_thread_68":"WAITING",
    "biz_thread_69":"TIMED_WAITING",
    "biz_thread_70":"TIMED_WAITING",
    "biz_thread_71":"TIMED_WAITING",
    "biz_thread_72":"WAITING",
    "biz_thread_73":"TIMED_WAITING",
    "biz_thread_74":"TIMED_WAITING",
    "biz_thread_75":"TIMED_WAITING",
    "biz_thread_76":"WAITING",
    "biz_thread_77":"TIMED_WAITING",
    "biz_thread_78":"TIMED_WAITING",
    "biz_thread_79":"TIMED_WAITING",
    "biz_thread_80":"TIMED_WAITING",
    "biz_thread_81":"TIMED_WAITING",
    "biz_thread_82":"WAITING",
    "biz_thread_83":"TIMED_WAITING",
    "biz_thread_84":"TIMED_WAITING",
    "biz_thread_85":"TIMED_WAITING",
    "biz_thread_86":"TIMED_WAITING",
    "biz_thread_87":"TIMED_WAITING",
    "biz_thread_88":"TIMED_WAITING",
    "biz_thread_89":"WAITING",
    "biz_thread_90":"TIMED_WAITING",
    "biz_thread_91":"TIMED_WAITING",
    "biz_thread_92":"TIMED_WAITING",
    "biz_thread_93":"TIMED_WAITING",
    "biz_thread_94":"WAITING",
    "biz_thread_95":"TIMED_WAITING",
    "biz_thread_96":"TIMED_WAITING",
    "biz_thread_97":"TIMED_WAITING",
    "biz_thread_98":"TIMED_WAITING",
    "biz_thread_99":"TIMED_WAITING",
    "biz_thread_100":"TIMED_WAITING",
    "biz_thread_101":"TIMED_WAITING",
    "biz_thread_102":"TIMED_WAITING",
    "biz_thread_103":"WAITING",
    "biz_thread_114":"WAITING",
    "biz_thread_113":"TIMED_WAITING",
    "biz_thread_112":"WAITING",
    "biz_thread_111":"TIMED_WAITING",
    "biz_thread_110":"TIMED_WAITING",
    "biz_thread_109":"TIMED_WAITING",
    "biz_thread_108":"TIMED_WAITING",
    "biz_thread_107":"TIMED_WAITING",
    "biz_thread_106":"TIMED_WAITING",
    "biz_thread_105":"WAITING",
    "biz_thread_104":"TIMED_WAITING",
    "biz_thread_115":"TIMED_WAITING",
    "biz_thread_116":"TIMED_WAITING",
    "biz_thread_122":"TIMED_WAITING",
    "biz_thread_121":"TIMED_WAITING",
    "biz_thread_120":"TIMED_WAITING",
    "biz_thread_119":"TIMED_WAITING",
    "biz_thread_118":"TIMED_WAITING",
    "biz_thread_117":"TIMED_WAITING",
    "biz_thread_126":"TIMED_WAITING",
    "biz_thread_125":"TIMED_WAITING",
    "biz_thread_124":"WAITING",
    "biz_thread_123":"WAITING",
    "biz_thread_127":"WAITING"

Upvotes: 2

Views: 2507

Answers (1)

Allen Xu
Allen Xu

Reputation: 143

Finally!

I do not know which java option or other things helped, but there is full thread dump in the nohup stdout redirected file(There was no thread dump for the first time when it occurred).

2019-02-26 19:40:29
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.192-b12 mixed mode):

"biz_thread_87" #104 prio=5 os_prio=0 tid=0x00007f93e76fd800 nid=0x2b24 waiting for monitor entry [0x00007f938bfbe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:217)
    - waiting to lock <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:207)
    at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
    at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
    at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1999)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1868)
    at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:194)
    at com.some.company.task.impl.StartTask.interruptProcess(StartTask.java:17)
    at com.some.company.common.task.AbstractTask.process(AbstractTask.java:88)
    at com.some.company.common.task.TaskManager.exec(TaskManager.java:174)
    at com.some.company.common.task.Context.lambda$resubmit$0(Context.java:426)
    at com.some.company.common.task.Context$$Lambda$97/519533636.run(Unknown Source)
    at com.some.company.common.task.TaskManager.lambda$delayExec$2(TaskManager.java:121)
    at com.some.company.common.task.TaskManager$$Lambda$37/873634936.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"nioEventLoopGroup-173-1" #315 prio=10 os_prio=0 tid=0x00007f939420e000 nid=0x2ced runnable [0x00007f937b3e8000]
   java.lang.Thread.State: RUNNABLE
    at java.io.RandomAccessFile.writeBytes(Native Method)
    at java.io.RandomAccessFile.write(RandomAccessFile.java:525)
    at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.writeToDestination(RollingRandomAccessFileManager.java:109)
    - eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:288)
    - eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.flush(RollingRandomAccessFileManager.java:128)
    - locked <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:179)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
    at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:218)
    at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
    at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
    at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2005)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1876)
    at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:184)
    at com.some.company.UserWifiClient.sendHeartbeat(UserWifiEppClient.java:39)
    at com.some.company.AbstractNettyClient.lambda$scheduleHeartbeat$0(AbstractNettyClient.java:146)
    at com.some.company.AbstractNettyClient$$Lambda$73/2134853806.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:133)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

There are 101 BLOCKED threads and 94 threads "waiting to lock <0x00000000c06725d8>".

So I think it must be a misuse of log4j Logger. In AbstractTask the "logger" is created like below

protected final static Logger logger = LoggerFactory.getLogger(AbstractTask.class);

All AbstractTask implementations just use this "logger" but the logger created this way introduced contention when multi-thread trying to use it. Now I create logger this way in AbstractTask

protected Logger logger = LoggerFactory.getLogger(getClass());

Then each real task should have their own logger object to use and will not face race-condition problem.

See this: Should logger be private static or not

What a big lesson!

By the way, did JVM auto dump threads to stdout sometime? I found the above "Full thread dump ..." in the nohup stdout redirected file.

Was this kind of thread-blocked problem a dead-lock? The app recovered about 2 hours later. If it was a dead-lock, how could it recover?

Update(2019-03-11)

Actually after I modified the way of using log4j logger, I accidentally came across this problem again, even after one week smooth run of the app. I realized that maybe it was caused by jmap. I run below command last Friday for several hours

while [ true ]; do jmap -heap 340; echo "=================================================="; sleep 30; done

Then at about 17:00 the app froze again! So if your app run correctly, run jmap or jstack against the app pid as fewer as possible! These commands will hang JVM. When they are executed too frequently, JVM will possibly get-frozen.

Upvotes: 4

Related Questions