Reputation: 594
I'm running a google cloud composer GKE cluster. I have a default node pool of 3 normal CPU nodes and one nodepool with a GPU node. The GPU nodepool has autoscaling activated.
I want to run a script inside a docker container on that GPU node.
For the GPU operating system I decided to go with cos_containerd instead of ubuntu.
I've followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus and ran this line:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
The GPU now shows up when I run "kubectl describe" on the GPU node, however my test scripts debug information tells me, that the GPU is not being used.
When I connect to the autoprovisioned GPU node via ssh, I can see, that I still need to run the
cos extensions gpu install
in order to use the GPU.
I now want to make my cloud composer GKE cluster to run "cos-extensions gpu install" whenever a node is being created by the autoscaler feature.
I would like to apply something like this yaml:
#cloud-config
runcmd:
- cos-extensions install gpu
to my cloud composer GKE cluster.
Can i do that with kubectl apply ? Ideally I would like to only run that yaml code onto the GPU node. How can I achieve that?
I'm new to Kubernetes and I've already spent a lot of time on this without success. Any help would be much appreciated.
Best, Phil
UPDATE: ok thx to Harsh I realized I have to go via Daemonset + ConfigMap like here: https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial
My GPU node has the label
gpu-type=t4
so I've created and kubectl applied this ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: phils-init-script
labels:
gpu-type: t4
data:
entrypoint.sh: |
#!/usr/bin/env bash
ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"
chroot "${ROOT_MOUNT_DIR}" cos-extensions gpu install
and here is my DaemonSet (I also kubectl applied this one):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: phils-cos-extensions-gpu-installer
labels:
gpu-type: t4
spec:
selector:
matchLabels:
gpu-type: t4
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: phils-cos-extensions-gpu-installer
gpu-type: t4
spec:
volumes:
- name: root-mount
hostPath:
path: /
- name: phils-init-script
configMap:
name: phils-init-script
defaultMode: 0744
initContainers:
- image: ubuntu:18.04
name: phils-cos-extensions-gpu-installer
command: ["/scripts/entrypoint.sh"]
env:
- name: ROOT_MOUNT_DIR
value: /root
securityContext:
privileged: true
volumeMounts:
- name: root-mount
mountPath: /root
- name: phils-init-script
mountPath: /scripts
containers:
- image: "gcr.io/google-containers/pause:2.0"
name: pause
but nothing happens, i get the message "Pods are pending".
During the run of the script I connect via ssh to the GPU node and can see that the ConfigMap shell code didn't get applied.
What am I missing here?
I'm desperately trying to make this work.
Best, Phil
Thanks for all your help so far!
Upvotes: 4
Views: 1818
Reputation: 1102
If you've installed the driver so many times and nvidia-smi
is still failing to communicate, take a look into prime-select
.
Run prime-select query
, this way you are going to get all possible options, it must show at least nvidia | intel
.
Select prime-select nvidia
.
Then, if you see nvidia is already selected
, choose a different one, e.g. prime-select intel
. Next, switch back to nvidia prime-select nvidia
.
Reboot and check nvidia-smi
.
Plus, it could be a good idea to run again:
sudo apt install nvidia-cuda-toolkit
When it finishes, reboot the machine, and nvidia-smi has to work then.
Now, in other cases it works to follow these instructions to install CuDNn and Cuda on VMs cuda_11.2_installation_on_Ubuntu_20.04.
And finally, in some other cases, it is caused by unattended-upgrades. Take a look into the settings and adjust them if it is causing unexpected results. This URL has the documentation for Debian, and I was able to see that you already tested with that distro UnattendedUpgrades.
Upvotes: 1
Reputation: 30180
Can i do that with kubectl apply ? Ideally I would like to only run that yaml code onto the GPU node. How can I achieve that?
Yes, You can run the Deamon set on each node which will run the command on Nodes.
As you are on GKE and Daemon set will also run the command or script on New nodes also which are getting scaled up also.
Daemon set is mainly for running applications or deployment on each available node in the cluster.
We can leverage this deamon set and run the command on each node that exist and is also upcoming.
Example YAML :
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-initializer
labels:
app: default-init
spec:
selector:
matchLabels:
app: default-init
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: node-initializer
app: default-init
spec:
volumes:
- name: root-mount
hostPath:
path: /
- name: entrypoint
configMap:
name: entrypoint
defaultMode: 0744
initContainers:
- image: ubuntu:18.04
name: node-initializer
command: ["/scripts/entrypoint.sh"]
env:
- name: ROOT_MOUNT_DIR
value: /root
securityContext:
privileged: true
volumeMounts:
- name: root-mount
mountPath: /root
- name: entrypoint
mountPath: /scripts
containers:
- image: "gcr.io/google-containers/pause:2.0"
name: pause
Github link for example : https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial
Exact deployment step : https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets#deploying_the_daemonset
Full article : https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets
Upvotes: 3