ThinkTeamwork
ThinkTeamwork

Reputation: 594

Install GPU Driver on autoscaling Node in GKE (Cloud Composer)

I'm running a google cloud composer GKE cluster. I have a default node pool of 3 normal CPU nodes and one nodepool with a GPU node. The GPU nodepool has autoscaling activated.

I want to run a script inside a docker container on that GPU node.

For the GPU operating system I decided to go with cos_containerd instead of ubuntu.

I've followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus and ran this line:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

The GPU now shows up when I run "kubectl describe" on the GPU node, however my test scripts debug information tells me, that the GPU is not being used.

When I connect to the autoprovisioned GPU node via ssh, I can see, that I still need to run the

cos extensions gpu install

in order to use the GPU.

I now want to make my cloud composer GKE cluster to run "cos-extensions gpu install" whenever a node is being created by the autoscaler feature.

I would like to apply something like this yaml:

#cloud-config

runcmd:
  - cos-extensions install gpu

to my cloud composer GKE cluster.

Can i do that with kubectl apply ? Ideally I would like to only run that yaml code onto the GPU node. How can I achieve that?

I'm new to Kubernetes and I've already spent a lot of time on this without success. Any help would be much appreciated.

Best, Phil

UPDATE: ok thx to Harsh I realized I have to go via Daemonset + ConfigMap like here: https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

My GPU node has the label

gpu-type=t4

so I've created and kubectl applied this ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: phils-init-script
  labels:
    gpu-type: t4
data:
  entrypoint.sh: |
    #!/usr/bin/env bash

    ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"

    chroot "${ROOT_MOUNT_DIR}" cos-extensions gpu install

and here is my DaemonSet (I also kubectl applied this one):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: phils-cos-extensions-gpu-installer
  labels:
    gpu-type: t4
spec:
  selector:
    matchLabels:
      gpu-type: t4
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: phils-cos-extensions-gpu-installer
        gpu-type: t4
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: phils-init-script
        configMap:
          name: phils-init-script
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: phils-cos-extensions-gpu-installer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: phils-init-script
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

but nothing happens, i get the message "Pods are pending".

During the run of the script I connect via ssh to the GPU node and can see that the ConfigMap shell code didn't get applied.

What am I missing here?

I'm desperately trying to make this work.

Best, Phil

Thanks for all your help so far!

Upvotes: 4

Views: 1818

Answers (2)

If you've installed the driver so many times and nvidia-smi is still failing to communicate, take a look into prime-select.

  1. Run prime-select query, this way you are going to get all possible options, it must show at least nvidia | intel.

  2. Select prime-select nvidia.

  3. Then, if you see nvidia is already selected, choose a different one, e.g. prime-select intel. Next, switch back to nvidia prime-select nvidia.

  4. Reboot and check nvidia-smi.

Plus, it could be a good idea to run again:

sudo apt install nvidia-cuda-toolkit

When it finishes, reboot the machine, and nvidia-smi has to work then.

Now, in other cases it works to follow these instructions to install CuDNn and Cuda on VMs cuda_11.2_installation_on_Ubuntu_20.04.

And finally, in some other cases, it is caused by unattended-upgrades. Take a look into the settings and adjust them if it is causing unexpected results. This URL has the documentation for Debian, and I was able to see that you already tested with that distro UnattendedUpgrades.

Upvotes: 1

Harsh Manvar
Harsh Manvar

Reputation: 30180

Can i do that with kubectl apply ? Ideally I would like to only run that yaml code onto the GPU node. How can I achieve that?

Yes, You can run the Deamon set on each node which will run the command on Nodes.

As you are on GKE and Daemon set will also run the command or script on New nodes also which are getting scaled up also.

Daemon set is mainly for running applications or deployment on each available node in the cluster.

We can leverage this deamon set and run the command on each node that exist and is also upcoming.

Example YAML :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-initializer
  labels:
    app: default-init
spec:
  selector:
    matchLabels:
      app: default-init
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: node-initializer
        app: default-init
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: entrypoint
        configMap:
          name: entrypoint
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: node-initializer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: entrypoint
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Github link for example : https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

Exact deployment step : https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets#deploying_the_daemonset

Full article : https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets

Upvotes: 3

Related Questions