Long garbage collection times dropping network connections but not bouncing the pod in Kubernetes

Question

I have the following scenario:

I have a pod running an application with two sidecar containers.
All three containers have multiple network connections (around 7 servers are connected to them). They are constantly downloading/sending files or keeping a connection to get data from other servers, including stock exchange servers.
The Kubernetes version is v1.24.7+rke2r1
The Java version is openjdk_zulu8 (8.0.392-0) - jre1.8.0_362-amd64
The images were built using docker 1.13.1
The Linux image is RHEL 7.9 (Maipo)
We're not using any cloud service, this is our internal K8s environment.
We're using Ceph as storage
The heap memory limits are not specified, we let the JVM decide.
Our Prometheus/Grafana graphs are showing no issues with resources, so we don't believe that it's a resources issue, but this is how we're setting up memory and CPU:

Main container:

limits:
  cpu: 3000m
  memory: 10Gi
requests:
  cpu: 3000m
  memory: 10Gi

Sidecar container 1:

limits:
  cpu: 3000m
  memory: 4400Mi
requests:
  cpu: 2000m
  memory: 4000Mi

Sidecar container 2:

limits:
  cpu: 3000m
  memory: 4400Mi
requests:
  cpu: 2000m
  memory: 4000Mi

The problem that I'm facing is that all of the sudden our applications are freezing for garbage collection for several minutes and stopping all the threads. This has happened on all three containers. This ends up dropping all the network connections since these applications must be heartbeating to keep the connections alive. The weird thing is that this doesn't really stop the application, so the pod is not restarted.

We've tried to tweak the java parameters to show more data regarding the GC (and add more resources), but we really can't find the issue here (that's why you see all these fancy extra parameters, that we've added after seeing the issue). Any help/hint is very welcomed.

Main java call parameters:

/usr/java/jre1.8.0_362-amd64/bin/java -showversion -XX:+PrintFlagsFinal -XshowSettings:system -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDetails -XX:MaxGCPauseMillis=200 -Dsun.rmi.dgc.client.gcInterval=604800000 -Dsun.rmi.dgc.server.gcInterval=604800000 -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500 -Djava.net.preferIPv4Stack=true -XX:+UseContainerSupport -XX:ActiveProcessorCount=2 -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -javaagent:/opt/jhiccup/lib/jhiccup-2.0.10.jar=-d,5000,-i,1000,-s,3,-l

When there's the GC issue, we start seeing CPU throttling and the Sync column going higher:

49335.605: RevokeBias                       [      77          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
49335.605: RevokeBias                       [      77          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
49340.629: RevokeBias                       [      76          0              0    ]      [     0     0 21728     0     0    ]  0

I was looking at the TOP command and saw a lot of processes in this Status = D, that is uninterruptible sleep. Looks like something in the CEPH is freezing the whole application. Did someone see this before?

This is the print of the GC statistics right after the freezing happening. Note that the sync column has a very high value:

Long garbage collection times dropping network connections but not bouncing the pod in Kubernetes

Answers (1)

Related Questions