Patrick
Patrick

Reputation: 2719

Unable to request cluster with one GPU on GKE

I'm trying to create a minimal cluster with 1 node and 1 GPU/node. My command:

gcloud container clusters create cluster-gpu     --num-nodes=1     --zone=us-central1-a      --machine-type="n1-highmem-2"  --accelerator="type=nvidia-tesla-k80,count=1"     --scopes="gke-default,storage-rw"

creates the cluster. Now when the following pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: gke-training-pod-gpu
spec:
  containers:
  - name: my-custom-container
    image: gcr.io/.../object-classification:gpu
    resources:
      limits:
        nvidia.com/gpu: 1

is applied to my cluster, I can see in the GKE dashboard that the gke-training-pod-gpu pod is never created. When I do the same as above, only replacing num-nodes=1 by num-nodes=2, this time I get the following error:

ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request: resource "NVIDIA_K80_GPUS": request requires '2.0' and is short '1.0'. project has a quota of '1.0' with '1.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=...

Is there any way to use a GPU when the quota is 1?

EDIT:

when pod has been created with kubectl apply command, a kubectl describe pod gke-training-pod-gpu command shows following event:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  48s (x2 over 48s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Upvotes: 0

Views: 643

Answers (2)

Gari Singh
Gari Singh

Reputation: 12053

Looks like you need to install the NVIDIA CPU device driver on your worker node(s).

Running

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

should do the trick.

Upvotes: 1

AmericanJohnny
AmericanJohnny

Reputation: 111

The best solution as I see it is to request a quota increase in the IAM & Admin Quotas page.

As for the reason this is happening, I can only imagine that both the node and the pod are requesting GPUs, but only the node is getting it because of the capped quota.

Upvotes: 1

Related Questions