Reputation: 809
I have the following scenario:
Main container:
limits:
cpu: 3000m
memory: 10Gi
requests:
cpu: 3000m
memory: 10Gi
Sidecar container 1:
limits:
cpu: 3000m
memory: 4400Mi
requests:
cpu: 2000m
memory: 4000Mi
Sidecar container 2:
limits:
cpu: 3000m
memory: 4400Mi
requests:
cpu: 2000m
memory: 4000Mi
The problem that I'm facing is that all of the sudden our applications are freezing for garbage collection for several minutes and stopping all the threads. This has happened on all three containers. This ends up dropping all the network connections since these applications must be heartbeating to keep the connections alive. The weird thing is that this doesn't really stop the application, so the pod is not restarted.
We've tried to tweak the java parameters to show more data regarding the GC (and add more resources), but we really can't find the issue here (that's why you see all these fancy extra parameters, that we've added after seeing the issue). Any help/hint is very welcomed.
Main java call parameters:
/usr/java/jre1.8.0_362-amd64/bin/java -showversion -XX:+PrintFlagsFinal -XshowSettings:system -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDetails -XX:MaxGCPauseMillis=200 -Dsun.rmi.dgc.client.gcInterval=604800000 -Dsun.rmi.dgc.server.gcInterval=604800000 -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500 -Djava.net.preferIPv4Stack=true -XX:+UseContainerSupport -XX:ActiveProcessorCount=2 -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -javaagent:/opt/jhiccup/lib/jhiccup-2.0.10.jar=-d,5000,-i,1000,-s,3,-l
When there's the GC issue, we start seeing CPU throttling and the Sync column going higher:
49335.605: RevokeBias [ 77 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
49335.605: RevokeBias [ 77 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
49340.629: RevokeBias [ 76 0 0 ] [ 0 0 21728 0 0 ] 0
I was looking at the TOP command and saw a lot of processes in this Status = D, that is uninterruptible sleep. Looks like something in the CEPH is freezing the whole application. Did someone see this before?
This is the print of the GC statistics right after the freezing happening. Note that the sync column has a very high value:
Upvotes: 2
Views: 384
Reputation: 809
We discovered that the problem was related to the storage. When changed from Ceph to CephRBD, the problem went away.
Upvotes: 1