Naman
Naman

Reputation: 32028

Unusual spike in thread counts every midnight in SpringBoot application

Application Context: We are currently deploying an application based on spring-boot:3.2.0 which involves REST-based APIs exposed to clients using spring-boot-starter-web, integrating with our datastore via org.springframework.boot:spring-boot-starter-data-elasticsearch topped up by an in-memory cache based on com.github.ben-manes.caffeine:caffeine and uses org.springframework.kafka:spring-kafka.

Issue: Every midnight exactly at 00:00 the thread count of our application spikes and after that very few threads are terminated, leaving our application with incrementally increased threads every day and resulting in the restart of pods sometimes(suspecting this due to OOM as the mem usage increases).

enter image description here

Observation: During this time, there is little deviation in the throughput pattern for the REST-based requests or consuming pipelines of the messages.


Attempt: So, we tried analyzing the thread dump of our application and compared the dump over two different days. One of the major aspects visible was that there are a significant number of threads that are timed_waiting in our application.

enter image description here

stacktrace for these looks like

"http-nio-8080-exec-61" #37660 [37664] daemon prio=5 os_prio=0 cpu=38462.67ms elapsed=46889.40s tid=0x00007f2e0c0557b0 nid=37664 waiting on condition [0x00007f2ddf2fe000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21/Native Method)
- parking to wait for <0x0000000684f26058> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21/AbstractQueuedSynchronizer.java:1758)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@21/LinkedBlockingQueue.java:460)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:99)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1113)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.runWith(java.base@21/Thread.java:1596)
at java.lang.Thread.run(java.base@21/Thread.java:1583)

Help Wanted: But from here, we are currently lost in understanding which direction to look into. Looking forward to some hints from the community over:

  1. Could one of the libraries mentioned have some default configuration to perform some task internal to JVM exactly at 00:00?
  2. Based on the stacktrace of the dump, how do we figure out by what process was the object that a lot of threads are waiting for created? (additional info)

Note - I can share the thread dump if that helps and further information as and when required for this application. As I said, this literally happens every night with our application.


Update: Web and tomcat properties overridden for the application in setup include:

spring.servlet.multipart.max-file-size=10MB
spring.servlet.multipart.max-request-size=10MB
server.servlet.context-path=/places/api

server.tomcat.accesslog.enabled=true
server.tomcat.accesslog.suffix=.log
server.tomcat.accesslog.prefix=access_log
server.tomcat.accesslog.file-date-format=.yyyy-MM-dd
server.tomcat.accesslog.pattern=%h %l %t "%r" %s %b "%{Referer}i" "%{User-Agent}i" %D
server.tomcat.accesslog.rotate=true
server.tomcat.basedir=/var/log
server.tomcat.accesslog.directory=places
server.tomcat.accesslog.max-days=7

Upvotes: 3

Views: 484

Answers (1)

Naman
Naman

Reputation: 32028

First Phase:

Thanks to the comment by the user which led to me investigating the non-default configurations that the application had. In our case, we were dealing with the following:

server.tomcat.accesslog.file-date-format=.yyyy-MM-dd
server.tomcat.accesslog.rotate=true

which meant that every day exactly at 00:00 timestamp our log files were to be rotated to a newer filename. This was causing multiple tomcat threads to be left in the timed_waiting state.

Upon changing the rotation duration from a day to a month, the impact could be perceived - the thread count increase was not evident during midnight. But, still, some threads resulted in the timed_waiting state at odd intervals for our application.


Second Phase

Upon further inspection of our application's thread dumps, we could suspect the logging still. We identified that we were deploying our application across multiple pods(three in our case) with a K8s cluster over GCP with each logging to the same access log file on a mounted disk. This is what I would suspect resulted in a significant impact over the threads moving to a waiting state to catch our eye.

I validated these with a pod-based prefix added to the log files inclusive of app logs, access logs, and gc logs on our application. That has helped us resort back to a normal execution of our application over multiple days.

enter image description here

Note:

(1) I have created a bug on the Tomcat bug tracker since the search doesn't yield any existing such issue reported or references to state that such practices shall be avoided.

(2) I also suspect the impact of this misconfiguration, while making use of the virtual threads could have been much degrading on the deployed application.

Upvotes: 1

Related Questions