dustinmoris
dustinmoris

Reputation: 3361

New GKE cluster logging thousands of errors

After creating a fresh new Kubernetes cluster in Google Kubernetes Engine I am seeing a lot of errors in Google Cloud logging related to the metrics agent.

I have had that problem with an existing cluster on version 1.18.x. Then I upgraded to 1.19.x after the suggestion that this would fix it. However, the problem persisted so I upgraded to 1.20.x and still no change.

Eventually I created a new cluster with the most recent Kubernetes version and still see hundreds of errors being logged immediately after:

gcloud beta container clusters create "my-cluster-1" \
    --project "my-project-1" \
    --zone "europe-west2-a" \
    --no-enable-basic-auth \
    --release-channel "rapid" \
    --cluster-version "1.20.2-gke.2500" \
    --machine-type "e2-standard-2" \
    --image-type "COS_CONTAINERD" \
    --disk-type "pd-standard" \
    --disk-size "100" \
    --metadata disable-legacy-endpoints=true \
    --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
    --num-nodes "1" \
    --enable-stackdriver-kubernetes \
    --enable-private-nodes \
    --master-ipv4-cidr "172.16.0.0/28" \
    --enable-ip-alias \
    --network "projects/my-project-1/global/networks/default" \
    --subnetwork "projects/my-project-1/regions/europe-west2/subnetworks/default" \
    --default-max-pods-per-node "110" \
    --no-enable-master-authorized-networks \
    --addons HorizontalPodAutoscaling,HttpLoadBalancing,NodeLocalDNS,GcePersistentDiskCsiDriver \
    --enable-autoupgrade \
    --enable-autorepair \
    --max-surge-upgrade 1 \
    --max-unavailable-upgrade 0 \
    --workload-pool "my-project-1.svc.id.goog" \
    --enable-shielded-nodes \
    --node-locations "europe-west2-a","europe-west2-b","europe-west2-c"

In Google Cloud logging I check for errors using this query:

severity=ERROR
AND (resource.labels.container_name:"gke-metrics-agent"
OR resource.labels.container_name="metrics-server-nanny")
resource.labels.cluster_name="my-cluster-1"

enter image description here

As per another suggestion I waited for more than 10 minutes and was still getting the same amount of errors being logged:

enter image description here


UPDATE 05 March 2021

Creating a new test cluster via the UI. Not changing anything except the cluster name set to test-cluster-1 and the Zone to europe-west-2a and the Kubernetes version to the latest of the rapid channel as per suggestion:

enter image description here

Immediately after creating the new cluster I'm getting hundreds of errors logged:

enter image description here

I'll observe for 15-20 minutes to see if it remains that way.

Upvotes: 5

Views: 1243

Answers (1)

PjoterS
PjoterS

Reputation: 14084

As was mentioned in previous thread, GKE cluster v 1.18.12-gke.1206 contained bug which logged hundreds of Prometheus errors:

github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport

This issue has been already reported via Issue Tracker. This issue has been fixed in versions 1.18.14-gke.1200+ and 1.19.6-gke.600+. New cluster with mentioned versions or newer contains a fix for this issue.

OP's cluster configuration contained a flag which caused this issue to reoccur. I have tested a few scenarios but OP @dustinmoris found that it was caused by the NodeLocalDNS addon.

Enabling one addon: NodeLocalDNS reoccurs the issue. It was tested on versions: 1.20.2-gke.2500, 1.19.7-gke.1500, 1.19.7-gke.2503, 1.18.15-gke.1102.

Proper comments were already added to Issue Tracker. For all updates, please check this Issue Tracker.

Upvotes: 2

Related Questions