Felipe Serna
Felipe Serna

Running a docker container which uses GPU from kubernetes fails to find the GPU

I want to run a docker container which uses GPU (it runs a cnn to detect objects on a video), and then run that container on Kubernetes.

I can run the container from docker alone without problems, but when I try to run the container from Kubernetes it fails to find the GPU.

I run it using this command:

kubectl exec -it namepod /bin/bash

This is the problem that I get:

kubectl exec -it tym-python-5bb7fcf76b-4c9z6 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@tym-python-5bb7fcf76b-4c9z6:/opt# cd servicio/
root@tym-python-5bb7fcf76b-4c9z6:/opt/servicio# python3 TM_Servicev2.py 
 Try to load cfg: /opt/darknet/cfg/yolov4.cfg, weights: /opt/yolov4.weights, clear = 0 
CUDA status Error: file: ./src/dark_cuda.c : () : line: 620 : build time: Jul 30 2021 - 14:05:34 

 CUDA Error: no CUDA-capable device is detected
python3: check_error: Unknown error -1979678822

EDIT. I followed all the steps on the Nvidia docker 2 guide and downloaded the Nvidia plugin for Kubernetes.

however when I deploy Kubernetes it stays as "pending" and never actually starts. I don't get an error anymore, but it never starts. The pod appears like this:

gpu-pod                       0/1     Pending   0          3m19s


I ended up reinstalling everything and now my pod appears completed but not running. like this.

default       gpu-operator-test                          0/1     Completed   0             62m

Answering Wiktor. when I run this command:

kubectl describe pod gpu-operator-test 

I get:

Name:         gpu-operator-test
Namespace:    default
Priority:     0
Node:         pdi-mc/
Start Time:   Mon, 09 Aug 2021 12:09:51 -0500
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: 968e49d27fb3d86ed7e70769953279271b675177e188d52d45d7c4926bcdfbb2
Status:       Succeeded
    Container ID:   docker://d49545fad730b2ec3ea81a45a85a2fef323edc82e29339cd3603f122abde9cef
    Image:          nvidia/samples:vectoradd-cuda10.2
    Image ID:       docker-pullable://nvidia/samples@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 09 Aug 2021 12:10:29 -0500
      Finished:     Mon, 09 Aug 2021 12:10:30 -0500
    Ready:          False
    Restart Count:  0
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
    Environment:       <none>
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9ktgq (ro)
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

I'm using this configuration file to create the pod

apiVersion: v1
kind: Pod
  name: gpu-operator-test
  restartPolicy: OnFailure
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda10.2"
         nvidia.com/gpu: 1

Answers (1)

Wytrzymały Wiktor
Wytrzymały Wiktor

Addressing two topics here:

  1. The error you saw at the beginning:

kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.

Means that you tried to use a deprecated version of the kubectl exec command. The proper syntax is:

$ kubectl exec (POD | TYPE/NAME) [-c CONTAINER] [flags] -- COMMAND [args...]

See here for more details.

  1. According the the official docs the gpu-operator-test pod should run to completion: enter image description here

You can see that the pod's status is Succeeded and also:

 State:          Terminated
   Reason:       Completed
   Exit Code:    0

Exit Code: 0 means that the specified container command completed successfully.

More details can be found in the official docs.

