Reputation: 1

Not able to run pods with GPU in GKE: 2 insufficient nvidia.com/gpu error

we followed this guide to use GPU enabled nodes in our existing cluster but when we try to schedule pods we're getting 2 Insufficient nvidia.com/gpu error

Details:

We are trying to use GPU in our existing cluster and for that we're able to successfully create a NodePool with a single node having GPU enabled.

Then as a next step according to the guide above we've to create a daemonset and we're also able to run the DS successfully.

But now when we are trying to schedule the Pod using the following resource section the pod becomes un-schedulable with this error 2 insufficient nvidia.com/gpu

    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: 200m
        memory: 3Gi

Specs:

Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4

any help or pointers to debug this further will be highly appreaciated.

TIA,

output of kubectl get node <gpu-node> -o yaml [Redacted]

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: n1-standard-4
    beta.kubernetes.io/os: linux
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-boot-disk: pd-standard
    cloud.google.com/gke-container-runtime: docker
    cloud.google.com/gke-nodepool: gpu-node
    cloud.google.com/gke-os-distribution: cos
    cloud.google.com/machine-family: n1
    failure-domain.beta.kubernetes.io/region: us-central1
    failure-domain.beta.kubernetes.io/zone: us-central1-b
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: n1-standard-4
    topology.kubernetes.io/region: us-central1
    topology.kubernetes.io/zone: us-central1-b
  name: gke-gpu-node-d6ddf1f6-0d7j
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: present
status:
  ...
  allocatable:
    attachable-volumes-gce-pd: "127"
    cpu: 3920m
    ephemeral-storage: "133948343114"
    hugepages-2Mi: "0"
    memory: 12670032Ki
    pods: "110"
  capacity:
    attachable-volumes-gce-pd: "127"
    cpu: "4"
    ephemeral-storage: 253696108Ki
    hugepages-2Mi: "0"
    memory: 15369296Ki
    pods: "110"
  conditions:
    ...
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: docker://19.3.14
    kernelVersion: 5.4.89+
    kubeProxyVersion: v1.18.17-gke.700
    kubeletVersion: v1.18.17-gke.700
    operatingSystem: linux
    osImage: Container-Optimized OS from Google

Tolerations from the deployments

  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Upvotes: 7

Answers (3)

Shithanshu

Reputation: 21

using nvidia-gpu-device-plugin is the first thing you should try but there are some more requirements that need to be fulfilled and ensured

ensure that device plugin is added to config /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"

Docker config should look something like this

{
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Last but not the least if you are still facing the same issue then make sure you are using an official base image released by NVIDIA

When I tried to custom install PyTorch on ubuntu with cuda in the docker build the image was built successfully but the application was not able to detect CUDA So I would suggest you to go with official build images by NVIDIA

Upvotes: 1

andolsi zied

Reputation: 3791

To complete the answer of @hilesenrat which was not completely adapted to my case but it allowed me to find the solution.

Actually in my case the plugin daemonset is already installed but the pods won't start for a volume error.

kubectl get pods -n kube-system | grep -i nvidia
nvidia-gpu-device-plugin-cbk9m                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-gt5vf                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-mgrr5                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-vt474                      0/1    ContainerCreating   0          22h

 kubectl describe pods nvidia-gpu-device-plugin-cbk9m -n kube-system
 ...
 Warning  FailedMount  5m1s (x677 over 22h)   kubelet  MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory

In fact, according to google cloud documentation NVIDIA device drivers needed to be installed on these nodes. Once installed, the situation is unblocked.

kubectl get pods -n kube-system | grep -i nvidia
nvidia-driver-installer-6mxr2                       1/1     Running   0          2m33s
nvidia-driver-installer-8lww7                       1/1     Running   0          2m33s
nvidia-driver-installer-m748p                       1/1     Running   0          2m33s
nvidia-driver-installer-r4x8c                       1/1     Running   0          2m33s
nvidia-gpu-device-plugin-cbk9m                      1/1     Running   0          22h
nvidia-gpu-device-plugin-gt5vf                      1/1     Running   0          22h
nvidia-gpu-device-plugin-mgrr5                      1/1     Running   0          22h
nvidia-gpu-device-plugin-vt474                      1/1     Running   0          22h

Upvotes: 3

hilsenrat

Reputation: 1420

The nvidia-gpu-device-plugin should be installed in the GPU node as well. You should see nvidia-gpu-device-plugin DaemonSet in your kube-system namespace.

It should be automatically deployed by Google, but if you want to deploy it on your own, run the following command: kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

It will install the GPU plugin in the node and afterwards your pods will be able to consume it.

Upvotes: 6

Not able to run pods with GPU in GKE: 2 insufficient nvidia.com/gpu error

Answers (3)

Related Questions