Chao
Chao

Reputation: 905

One node for a GKE cluster cannot pull image from dockerhub

This is a very wried thing.

I created a private GKE cluster with a node pool of 3 nodes. Then I have a replica set with 3 pods. some of these pods will be scheduled to one node.

So one of these pods always get ImagePullBackOff, I check the error

Failed to pull image "bitnami/mongodb:3.6": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

And the pods scheduled to the remaining two nodes work well.

I ssh to that node, run docker pull and everything is fine. I cannot find another way to troubleshoot this error.

I tried to drain or delete that node and let the cluster to recreate the node. but it is still not working.

Help me, please.

Update: From GCP documentation, it will fail to pull images from the docker hub.

BUT the weirdest thing is ONLY ONE node is unable to pull the images.

Upvotes: 6

Views: 2822

Answers (2)

neoakris
neoakris

Reputation: 5135

I recall seeing this before and finding an answer before.

https://cloud.google.com/container-registry/docs/pulling-cached-images
Talks about it a little, but I'll explain it so it's easy to follow.

If I spin up a private GKE cluster and I create 3 deployments:

  • 1st uses image: nginx:latest
  • 2nd uses image: nginx:stable
  • 3rd uses image: docker.io/busybox:1.36.0-glibc

nginx:latest (common tag) will almost always work
nginx:stable (popular tag) will work sometimes
The super specific tag (rarely used tag) will almost always fail with ImagePullBackOff

So why is this the case?
1. The ImagePullBackOff happens when the pods/nodes have no NAT Gateway/no Internet Access
kubectl exec -it working-nginx-latest-pod -- curl yahoo.com
^-- You can prove no internet with this, note curl google.com is a bad test on GKE, because it's still reachable via googles internal network / you'll get a response, because google's network can reach google.com without having to go through the internet, that's why I recommend testing with a non google URL like yahoo.com
(Google's networking also occasionally does some counterintuitive / non-standard things, like route public IP Addresses over their internal network, so sometimes you can reach public IP addresses w/o internet access, it's usually google services with public IPs that are sometimes reachable w/o internet access.)

2. So the next question is, but wait... how are nginx:latest and nginx:stable able to pull image that exists on the internet/on docker hub, when there's no internet access? Basically why is it working for some images and not others?
Answer boils down to popularity of the image:tag pair. Is it popular enough to get cached in mirror.gcr.io?

The initial link I shared at the top mentions "Container Registry caches frequently-accessed public Docker Hub images on mirror.gcr.io", so basically if you reference a common tag of a popular image, you can sometimes get lucky enough to pull it even without internet, because the cache is reachable via private IP space / without internet access.

When a pod running on GKE private cluster gives you ImagePullBackOff, and you're like, what's going on? I know this image exists! docker pull docker.io/busybox:1.36.0-glibc pulls fine from my local machine, what's happening is that rarely used tag doesn't exist in their cache, that mirrors common tags of popular images.

Best way to fix it is to either pull all images from pkg.dev (GCP's Artifact Registry, which GKE should be able to access w/o internet access) or set up NAT gateway/ensure the private cluster has internet access. And you can use kubectl exec -it working-nginx-latest-pod -- curl yahoo.com as a feedback loop to check if the cluster has internet access as you tinker with VPC settings to add NAT GW.

https://cloud.google.com/kubernetes-engine/docs/best-practices/networking#use-cloudnat
mentions By default, (GKE) "private clusters don't have internet access. In order to allow Pods to reach the internet, enable Cloud NAT for each region. At a minimum, enable Cloud NAT for the primary and secondary ranges in the GKE subnet."

Upvotes: 3

Meir Tseitlin
Meir Tseitlin

Reputation: 2068

There was a related reported bug in Kubernetes 1.11

Make sure it is not your case

Upvotes: 1

Related Questions