Reputation: 32028
Application Context: We are currently deploying an application based on spring-boot:3.2.0
which involves REST-based APIs exposed to clients using spring-boot-starter-web
, integrating with our datastore via org.springframework.boot:spring-boot-starter-data-elasticsearch
topped up by an in-memory cache based on com.github.ben-manes.caffeine:caffeine
and uses org.springframework.kafka:spring-kafka
.
Issue: Every midnight exactly at 00:00 the thread count of our application spikes and after that very few threads are terminated, leaving our application with incrementally increased threads every day and resulting in the restart of pods sometimes(suspecting this due to OOM as the mem usage increases).
Observation: During this time, there is little deviation in the throughput pattern for the REST-based requests or consuming pipelines of the messages.
Attempt: So, we tried analyzing the thread dump of our application and compared the dump over two different days. One of the major aspects visible was that there are a significant number of threads that are timed_waiting
in our application.
stacktrace for these looks like
"http-nio-8080-exec-61" #37660 [37664] daemon prio=5 os_prio=0 cpu=38462.67ms elapsed=46889.40s tid=0x00007f2e0c0557b0 nid=37664 waiting on condition [0x00007f2ddf2fe000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21/Native Method)
- parking to wait for <0x0000000684f26058> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21/AbstractQueuedSynchronizer.java:1758)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@21/LinkedBlockingQueue.java:460)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:99)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1113)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.runWith(java.base@21/Thread.java:1596)
at java.lang.Thread.run(java.base@21/Thread.java:1583)
Help Wanted: But from here, we are currently lost in understanding which direction to look into. Looking forward to some hints from the community over:
Note - I can share the thread dump if that helps and further information as and when required for this application. As I said, this literally happens every night with our application.
Update: Web and tomcat properties overridden for the application in setup include:
spring.servlet.multipart.max-file-size=10MB
spring.servlet.multipart.max-request-size=10MB
server.servlet.context-path=/places/api
server.tomcat.accesslog.enabled=true
server.tomcat.accesslog.suffix=.log
server.tomcat.accesslog.prefix=access_log
server.tomcat.accesslog.file-date-format=.yyyy-MM-dd
server.tomcat.accesslog.pattern=%h %l %t "%r" %s %b "%{Referer}i" "%{User-Agent}i" %D
server.tomcat.accesslog.rotate=true
server.tomcat.basedir=/var/log
server.tomcat.accesslog.directory=places
server.tomcat.accesslog.max-days=7
Upvotes: 3
Views: 484
Reputation: 32028
Thanks to the comment by the user which led to me investigating the non-default configurations that the application had. In our case, we were dealing with the following:
server.tomcat.accesslog.file-date-format=.yyyy-MM-dd
server.tomcat.accesslog.rotate=true
which meant that every day exactly at 00:00 timestamp our log files were to be rotated to a newer filename. This was causing multiple tomcat threads to be left in the timed_waiting
state.
Upon changing the rotation duration from a day to a month, the impact could be perceived - the thread count increase was not evident during midnight. But, still, some threads resulted in the timed_waiting
state at odd intervals for our application.
Upon further inspection of our application's thread dumps, we could suspect the logging still. We identified that we were deploying our application across multiple pods(three in our case) with a K8s cluster over GCP with each logging to the same access log file on a mounted disk. This is what I would suspect resulted in a significant impact over the threads moving to a waiting state to catch our eye.
I validated these with a pod-based prefix added to the log files inclusive of app logs, access logs, and gc logs on our application. That has helped us resort back to a normal execution of our application over multiple days.
Note:
(1) I have created a bug on the Tomcat bug tracker since the search doesn't yield any existing such issue reported or references to state that such practices shall be avoided.
(2) I also suspect the impact of this misconfiguration, while making use of the virtual threads could have been much degrading on the deployed application.
Upvotes: 1