Reputation: 1
we followed this guide to use GPU enabled nodes in our existing cluster but when we try to schedule pods we're getting 2 Insufficient nvidia.com/gpu error
Details:
We are trying to use GPU in our existing cluster and for that we're able to successfully create a NodePool with a single node having GPU enabled.
Then as a next step according to the guide above we've to create a daemonset and we're also able to run the DS successfully.
But now when we are trying to schedule the Pod using the following resource section the pod becomes un-schedulable with this error 2 insufficient nvidia.com/gpu
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 3Gi
Specs:
Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4
any help or pointers to debug this further will be highly appreaciated.
TIA,
output of kubectl get node <gpu-node> -o yaml
[Redacted]
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: n1-standard-4
beta.kubernetes.io/os: linux
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/gke-boot-disk: pd-standard
cloud.google.com/gke-container-runtime: docker
cloud.google.com/gke-nodepool: gpu-node
cloud.google.com/gke-os-distribution: cos
cloud.google.com/machine-family: n1
failure-domain.beta.kubernetes.io/region: us-central1
failure-domain.beta.kubernetes.io/zone: us-central1-b
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: n1-standard-4
topology.kubernetes.io/region: us-central1
topology.kubernetes.io/zone: us-central1-b
name: gke-gpu-node-d6ddf1f6-0d7j
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: present
status:
...
allocatable:
attachable-volumes-gce-pd: "127"
cpu: 3920m
ephemeral-storage: "133948343114"
hugepages-2Mi: "0"
memory: 12670032Ki
pods: "110"
capacity:
attachable-volumes-gce-pd: "127"
cpu: "4"
ephemeral-storage: 253696108Ki
hugepages-2Mi: "0"
memory: 15369296Ki
pods: "110"
conditions:
...
nodeInfo:
architecture: amd64
containerRuntimeVersion: docker://19.3.14
kernelVersion: 5.4.89+
kubeProxyVersion: v1.18.17-gke.700
kubeletVersion: v1.18.17-gke.700
operatingSystem: linux
osImage: Container-Optimized OS from Google
Tolerations from the deployments
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
Upvotes: 7
Views: 5673
Reputation: 21
using nvidia-gpu-device-plugin
is the first thing you should try but there are some more requirements that need to be fulfilled and ensured
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"
{
"exec-opts": ["native.cgroupdriver=systemd"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Last but not the least if you are still facing the same issue then make sure you are using an official base image released by NVIDIA
When I tried to custom install PyTorch on ubuntu with cuda in the docker build the image was built successfully but the application was not able to detect CUDA So I would suggest you to go with official build images by NVIDIA
Upvotes: 1
Reputation: 3791
To complete the answer of @hilesenrat which was not completely adapted to my case but it allowed me to find the solution.
Actually in my case the plugin daemonset is already installed but the pods won't start for a volume error.
kubectl get pods -n kube-system | grep -i nvidia
nvidia-gpu-device-plugin-cbk9m 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-gt5vf 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-mgrr5 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-vt474 0/1 ContainerCreating 0 22h
kubectl describe pods nvidia-gpu-device-plugin-cbk9m -n kube-system
...
Warning FailedMount 5m1s (x677 over 22h) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
In fact, according to google cloud documentation NVIDIA device drivers needed to be installed on these nodes. Once installed, the situation is unblocked.
kubectl get pods -n kube-system | grep -i nvidia
nvidia-driver-installer-6mxr2 1/1 Running 0 2m33s
nvidia-driver-installer-8lww7 1/1 Running 0 2m33s
nvidia-driver-installer-m748p 1/1 Running 0 2m33s
nvidia-driver-installer-r4x8c 1/1 Running 0 2m33s
nvidia-gpu-device-plugin-cbk9m 1/1 Running 0 22h
nvidia-gpu-device-plugin-gt5vf 1/1 Running 0 22h
nvidia-gpu-device-plugin-mgrr5 1/1 Running 0 22h
nvidia-gpu-device-plugin-vt474 1/1 Running 0 22h
Upvotes: 3
Reputation: 1420
The nvidia-gpu-device-plugin
should be installed in the GPU node as well. You should see nvidia-gpu-device-plugin
DaemonSet in your kube-system
namespace.
It should be automatically deployed by Google, but if you want to deploy it on your own, run the following command: kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
It will install the GPU plugin in the node and afterwards your pods will be able to consume it.
Upvotes: 6