Reputation: 56
I'm using GPU instances in a GKE environment on GCP. Unfortunately, I've been losing internet access on GPU instances. Sometimes it works, sometimes it doesn't. I first detected the issue as the request I was doing to download the pretrained model from huggingface couldn't resolve.
I've been playing around to better grasp the problem and this "internet loss" only happens in GPU instances.
I've been using the following kubernetes container to start troubleshoot the problem:
containers:
- name: test-img
image: test:0
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;"]
(test being a simple ubuntu image or so) afterwards I exec into the pod and try to install any package using apt-get install, for example apt-get install -y wget, and I get the following output:
Err:1 http://deb.debian.org/debian bullseye/main amd64 libpsl5 amd64 0.21.0-1.2
Temporary failure resolving 'deb.debian.org'
Err:2 http://deb.debian.org/debian bullseye/main amd64 wget amd64 1.21-1+b1
Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian bullseye/main amd64 publicsuffix all 20210108.1309-1
Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/libp/libpsl/libpsl5_0.21.0-1.2_amd64.deb Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/w/wget/wget_1.21-1%2bb1_amd64.deb Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/publicsuffix/publicsuffix_20210108.1309-1_all.deb Temporary failure resolving 'deb.debian.org'
Note that as soon as I retry this in a non-GPU instance, it works perfectly. Same image, same yaml file, etc.
I feel like it's something with DNS discovery, what's surprising is that this bug happened one first time a few weeks ago and disappeared by itself. It's back now, didn't change anything to the config so far.
Note that the pod is of kind Job. Any help welcomed, been struggling with this for quite a while
Upvotes: 0
Views: 75
Reputation: 63
After some research, I found out that the GKE will automatically taint the GPU nodes, so other system pods like kube-dns
could not run on it. Therefore, you could either create a non-GPU node or remove the taint.
For more information: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create
Upvotes: 0
Reputation: 56
I've managed to solve the issue, even though it's a very tricky solution. This bug only happens if I have GPU instances only. Adding one CPU instance as a e1 micro into the cluster solves the issue.
Could it be possible that GPU instances are on a specific VPN and only talk to the outside 'through' the cluster's cpu instances network ?
Upvotes: 1