tt_c
tt_c

Reputation: 56

GKE GPU Instances lose internet access (only when there's a GPU)

I'm using GPU instances in a GKE environment on GCP. Unfortunately, I've been losing internet access on GPU instances. Sometimes it works, sometimes it doesn't. I first detected the issue as the request I was doing to download the pretrained model from huggingface couldn't resolve.

I've been playing around to better grasp the problem and this "internet loss" only happens in GPU instances.

I've been using the following kubernetes container to start troubleshoot the problem:

      containers:
      - name: test-img
        image: test:0
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;"]

(test being a simple ubuntu image or so) afterwards I exec into the pod and try to install any package using apt-get install, for example apt-get install -y wget, and I get the following output:

Err:1 http://deb.debian.org/debian bullseye/main amd64 libpsl5 amd64 0.21.0-1.2
  Temporary failure resolving 'deb.debian.org'
Err:2 http://deb.debian.org/debian bullseye/main amd64 wget amd64 1.21-1+b1
  Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian bullseye/main amd64 publicsuffix all 20210108.1309-1
  Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/libp/libpsl/libpsl5_0.21.0-1.2_amd64.deb  Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/w/wget/wget_1.21-1%2bb1_amd64.deb  Temporary failure resolving 'deb.debian.org'
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/publicsuffix/publicsuffix_20210108.1309-1_all.deb  Temporary failure resolving 'deb.debian.org'

Note that as soon as I retry this in a non-GPU instance, it works perfectly. Same image, same yaml file, etc.

I feel like it's something with DNS discovery, what's surprising is that this bug happened one first time a few weeks ago and disappeared by itself. It's back now, didn't change anything to the config so far.

Note that the pod is of kind Job. Any help welcomed, been struggling with this for quite a while

Upvotes: 0

Views: 75

Answers (2)

Andy Yao
Andy Yao

Reputation: 63

After some research, I found out that the GKE will automatically taint the GPU nodes, so other system pods like kube-dns could not run on it. Therefore, you could either create a non-GPU node or remove the taint.

For more information: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create

Upvotes: 0

tt_c
tt_c

Reputation: 56

I've managed to solve the issue, even though it's a very tricky solution. This bug only happens if I have GPU instances only. Adding one CPU instance as a e1 micro into the cluster solves the issue.

Could it be possible that GPU instances are on a specific VPN and only talk to the outside 'through' the cluster's cpu instances network ?

Upvotes: 1

Related Questions