localghost
localghost

Reputation: 409

Error pulling docker image from GCR into GKE "Failed to pull image .... 403 Forbidden"

Background:

I have a GKE cluster which has suddenly stopped being able to pull my docker images from GCR; both are in the same GCP project. It has been working well for several months, no issues pulling images, and has now started throwing errors without having made any changes.

(NB: I'm generally the only one on my team who accesses Google Cloud, though it's entirely possible that someone else on my team may have made changes / inadvertently made changes without realising).

I've seen a few other posts on this topic, but the solutions offered in others haven't helped. Two of these posts stood out to me in particular, as they were both posted around the same day my issues started ~13/14 days ago. Whether this is coincidence or not who knows..

This post has the same issue as me; unsure whether the posted comments helped them resolve, but it hasn't fixed for me. This post seemed to also be the same issue, but the poster says it resolved by itself after waiting some time.

The Issue:

I first noticed the issue on the cluster a few days ago. Went to deploy a new image by pushing image to GCR and then bouncing the pods kubectl rollout restart deployment.

The pods all then came back with ImagePullBackOff, saying that they couldn't get the image from GCR:

kubectl get pods:

XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
...

kubectl describe pod XXX-XXX-XXX:

Normal   BackOff           20s                kubelet                                Back-off pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest"
Warning  Failed            20s                kubelet                                Error: ImagePullBackOff
Normal   Pulling           8s (x2 over 21s)   kubelet                                Pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest"
Warning  Failed            7s (x2 over 20s)   kubelet                                Failed to pull image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": failed to resolve reference "gcr.io/<GCR_PROJECT>/XXX:dev-latest": unexpected status code [manifests dev-latest]: 403 Forbidden
Warning  Failed            7s (x2 over 20s)   kubelet                                Error: ErrImagePull

I know that the image definitely exists in GCR -

I've SSH'd into one of the cluster nodes and tried to docker pull manually, with no success:

docker pull gcr.io/<GCP_PROJECT>/XXX:dev-latest
Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

(Also did a docker pull of a public mongodb image to confirm that was working, and it's specific to GCR).

So this leads me to believe it's an issue with the service account not having the correct permissions, as in the cloud docs under the 'Error 400/403' section. This seems to suggest that the service account has either been deleted, or edited manually.

During my troubleshooting, I tried to find out exactly which service account GKE was using to pull from GCR. In the steps outlined in the docs, it says that: The name of your Google Kubernetes Engine service account is as follows, where PROJECT_NUMBER is your project number:

service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com

I found the service account and checked the polices - it did have one for roles/container.serviceAgent, but nothing specifically mentioning kubernetes as I would expect from the description in the docs.. 'the Kubernetes Engine Service Agent role' (unless that is the one they're describing, in which case I'm no better off that before anyway..).

Must not have had the correct roles, so I then followed the steps to re-enable (disable then enable the Kubernetes API). Running cloud projects get-iam-policy <GCP_PROJECT> again and diffing the two outputs (before/after), the only difference is that a service account for '@cloud-filer...' has been deleted.

Thinking maybe the error was something else, I thought I would try spinning up a new cluster. Same error - can't pull images.

Send help..

I've been racking my brains to try to troubleshoot, but I'm now out of ideas! Any and all help much appreciated!

Upvotes: 14

Views: 24510

Answers (6)

Tho Quach
Tho Quach

Reputation: 1425

I had same problem like this, we have 2 cases:

  1. If you have specify the service account to node config when you use terraform to define nodepool in your GKE (docs), note this service account 's name
  2. If you dont specify anything, Terraform use use the default service account (minimum permission) to create nodepool for you, you can see the name of this service account because GKE only show Service account: default -> you have to go to Compute Engine -> VM instances -> click to name of instance that belong to your nodepool and search Service account to get the name you want

Final step, go to IAM and grant access to above service account, at least you need to grant Storage Object Viewer to your service account to pull images from registry (docs)

Come back to your GKE, delete your pods to re-trigger pull: image from registry, that 's ok for me.

Upvotes: 1

jwwebsensa
jwwebsensa

Reputation: 59

I don't know if it still helps, but I had the same issue and managed to fix it.

In my case I was deploying GKE trough terraform and did not specify oauth_scope property for node pool as show in example. As I understand you need to make gcp APIs available here to make nodes able to use them.

Upvotes: 5

Karandashov Daniil
Karandashov Daniil

Reputation: 1

In my case worked re-add (i.e. deletion and then addition) role "Artifact registry reader" for serviceaccount used by cluster.

Upvotes: 0

Jonas Bergstr&#246;m
Jonas Bergstr&#246;m

Reputation: 767

I believe the correct solution is to add the "roles/artifactregistry.reader" role to the service account that the node pool is configured to use. In terraform that can be done by

resource "google_project_iam_member" "allow_image_pull" {
  project = var.project_id
  role   = "roles/artifactregistry.reader"
  member = "serviceAccount:${var.service_account_email}"
}

Upvotes: 4

localghost
localghost

Reputation: 409

Have now solved this.

The service account had the correct roles/permissions, but for whatever reason stopped working.

I manually created a key for that service account, added that secret into the kube cluster, and set the service account to use that key.

Still at a loss as to why it wasn't already doing this, or why it stopped working in the first place all of a sudden, but it's working...

Fix was from this guide, from the section starting 'Create & use GCR credentials'.

Upvotes: 2

Arghya Sadhu
Arghya Sadhu

Reputation: 44657

From the docs compute engine default service account accesses container registry for pulling image not the kubernetes engine service account.You can go to node pool and check the service account name in the security section.Check the access logs of the service account to see errors and then provide necessary permission to the service account.

Upvotes: 1

Related Questions