Arthur Accioly
Arthur Accioly

Reputation: 809

Long garbage collection times dropping network connections but not bouncing the pod in Kubernetes

I have the following scenario:

  1. I have a pod running an application with two sidecar containers.
  2. All three containers have multiple network connections (around 7 servers are connected to them). They are constantly downloading/sending files or keeping a connection to get data from other servers, including stock exchange servers.
  3. The Kubernetes version is v1.24.7+rke2r1
  4. The Java version is openjdk_zulu8 (8.0.392-0) - jre1.8.0_362-amd64
  5. The images were built using docker 1.13.1
  6. The Linux image is RHEL 7.9 (Maipo)
  7. We're not using any cloud service, this is our internal K8s environment.
  8. We're using Ceph as storage
  9. The heap memory limits are not specified, we let the JVM decide.
  10. Our Prometheus/Grafana graphs are showing no issues with resources, so we don't believe that it's a resources issue, but this is how we're setting up memory and CPU:

The problem that I'm facing is that all of the sudden our applications are freezing for garbage collection for several minutes and stopping all the threads. This has happened on all three containers. This ends up dropping all the network connections since these applications must be heartbeating to keep the connections alive. The weird thing is that this doesn't really stop the application, so the pod is not restarted.

We've tried to tweak the java parameters to show more data regarding the GC (and add more resources), but we really can't find the issue here (that's why you see all these fancy extra parameters, that we've added after seeing the issue). Any help/hint is very welcomed.

Main java call parameters:

/usr/java/jre1.8.0_362-amd64/bin/java -showversion -XX:+PrintFlagsFinal -XshowSettings:system -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDetails -XX:MaxGCPauseMillis=200 -Dsun.rmi.dgc.client.gcInterval=604800000 -Dsun.rmi.dgc.server.gcInterval=604800000 -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500 -Djava.net.preferIPv4Stack=true -XX:+UseContainerSupport -XX:ActiveProcessorCount=2 -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -javaagent:/opt/jhiccup/lib/jhiccup-2.0.10.jar=-d,5000,-i,1000,-s,3,-l

When there's the GC issue, we start seeing CPU throttling and the Sync column going higher:

49335.605: RevokeBias                       [      77          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
49335.605: RevokeBias                       [      77          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
49340.629: RevokeBias                       [      76          0              0    ]      [     0     0 21728     0     0    ]  0

I was looking at the TOP command and saw a lot of processes in this Status = D, that is uninterruptible sleep. Looks like something in the CEPH is freezing the whole application. Did someone see this before?

Status D print

This is the print of the GC statistics right after the freezing happening. Note that the sync column has a very high value:

High Sync values in GC

Upvotes: 2

Views: 384

Answers (1)

Arthur Accioly
Arthur Accioly

Reputation: 809

We discovered that the problem was related to the storage. When changed from Ceph to CephRBD, the problem went away.

Upvotes: 1

Related Questions