Reputation: 5750
Certain pods on my cluster are extremely slow in almost all aspects. Startup time, network, i/o.
I have minimized the application code in these containers and it seems to have no effect, these are basically minimal containers running a simple webapi with a health check endpoint.
I'm wondering someone can help me figure out what's wrong or debug this.
When I say slow in all aspects I mean a couple things
Very slow startup. I actually have to change my readiness probe initial delay to near 5 minutes.
Inside the container running any command is slow. Running an apt-get update
takes near 5 minutes, even if the container has been running for hours.
Any connections to an RDS database will timeout for at least the first 10 minutes the pod is running, after that it's hit or miss, sometimes normal speed, sometimes we'll start getting connection timeouts again (mainly if the pod hasn't been used/requested for awhile).
On nearly identical pods with same base image the container will start in less than a couple seconds, and running an apt-get update
will take maybe 3 seconds. I cannot for the life of me see what is different between the pods that causes some to be 'good pods', and others to be 'bad pods'.
Running any of these images locally they will start in no time (less than a second or so).
Cluster (AWS)
Things I've checked/tried
too many pods
My first thought was maybe i'm running too many pods. I've launched up brand new nodes for this (c4.xlarge) and had this pod the only pod running in the cluster, issue still seen.
node resources
Checking every node level metric I could nothing looks out of the ordinary (also tried on several brand new pretty high powered nodes)
Deployment/Pod Metrics
I'm happy to show whatever metric anyone can think of here, nothing looks out of the norm. I have Prometheus running and have looked into every metric I could think to check. I can't see difference between a 'good' running pod and a 'bad' one.
cluster itself
I actually have 2 clusters, both provisioned with kops, this is seen on both clusters (though not always the same applications, which is odd).
Any help here is appreciated
Upvotes: 18
Views: 36450
Reputation: 3484
This is likely happening either due to the configuration of Resource Limits that are too constrained or by the lack of configuration Resource Requests which is allowing pods to be provisioned on nodes which do not have the necessary requirements to run their workloads.
You can resolve this by defining proper resource requests for each of your applications that are deployed to Kubernetes. In a nutshell, you can control limits and requests for shares of CPU time, bytes of memory, and Linux Hugepages.
Upvotes: 7