Reputation: 55
I have been experiencing a problem running an Autopilot GKE cluster. This problem actually inhabilitates pods to run so i'ts a little bit frustrating.
Actually mi configuration is only of two workloads. One deployment is configured to assign to the pods 1CPU and 2Gib RAM. But I'm constantly receiving the error that there is not enough CPU, also, in the events tab from the pod I can see the error "GCE quota exceeded" but without any details of what quota I've been exceeded. I've configured the deployment to only scale 1 pod.
Also I tried to look at quota usages ( IAM > Quotas) and the only quota I have with more than a 50% usage is Persistent Disk SSD, but the error i'm receiving indicates that the CPUs are not available.
This is an screenshot from the events tab:
This is an screenshot of my quota usage:
This is really messing me up, i absolutely dont understand why i'm receiving a quota error while i clearly have enough room for running 1 more CPU. I have been also checking minimum and maximums of resource requests for my class:
If that table is correct i should be inside the boundaries, 250mCPU < 1 CPU < 30 CPU. 512 MB < 2GB < 110 GB. I really don't understand why GKE is not executing my pod...
I have tried and investigated a lot, also other threads but im not able to find anything, hopefully someone has experienced the same problem and has succesfully solved it :)
Upvotes: 1
Views: 110
Reputation: 367
The error that you are encountering GCE out of resources. Pod is at risk of not being scheduled indicates that your GKE Autopilot cluster is unable to allocate the necessary resources (CPU, memory, etc.) to schedule your pod.
As per this GCP Autopilot troubleshooting document :
To resolve this issue, you can try the following:
Deploy the Pod in a different region or zone. If your Pod has a zonal restriction such as a topology selector, remove the restriction if you can. For instructions, see Place GKE Pods in specific zones.
Create a cluster in a different region and retry the deployment.
Try using a different compute class. Compute classes that are backed by smaller Compute Engine machine types are more likely to have available resources. For example, the default machine type for Autopilot has the highest availability. For a list of compute classes and the corresponding machine types, see When to use specific compute classes.
If you run GPU workloads, the requested GPU might not be available in your node location. Try deploying your workload in a different location or requesting a different type of GPU.
You can also request the higher quota value following this request the higher quota value to raise the request and also check this document which may help to resolve the issue.
Edit :
Clear the cluster by manually deleting the resources which are stuck or in an inconsistent state and try to create a new cluster which might help you to resolve the issue.
Upvotes: 0