Joseph Gagnon
Joseph Gagnon

Reputation: 2135

A Kubernetes worker node is being "ignored" for metrics scraping

I am using Prometheus and Grafana to collect and display metrics information for a Kubernetes cluster. In this case, I am collecting memory information and have discovered that one of the worker nodes does not appear in the results for certain metrics, while it does for other metrics. The only thing I can see that might have something to do with this, is that that node has a taint applied.

Here is the node taint:

nodeType=runner-node:NoExecute

The rest of the worker nodes have no (obvious) taint. Could this be the reason why nothing is being scraped?

Here is an exmaple of a metric that has information for this node (arc-worker-4):

Query:

machine_memory_bytes{node="arc-worker-4"}

Result:

metric value
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"} 135090417664

If run a query on another metric I get no result:

Query:

node_memory_MemTotal_bytes{node="arc-worker-4"}

Result:

Empty query result

In the group of metrics named node_memory_..._bytes (of which there are about 50), none of these have any data for this node. Why? I get data for all other nodes, including the master node.

Upvotes: 0

Views: 482

Answers (1)

Joseph Gagnon
Joseph Gagnon

Reputation: 2135

Was able to resolve this problem by adding a toleration into the Prometheus (kube-prometheus-stack) config. This allows the node-exporter that came with Prometheus to be deployed onto the node with that taint. I now am getting results from the node_memory_..._bytes family of metrics.

What was done:

In the Prometheus Helm chart values.yaml, the following was added:

  prometheus-node-exporter:
    tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nodeType
        operator: Equal
        value: runner-node
        effect: NoExecute

The first toleration is the default, but needs to be specified here otherwise it's blown away. I needed it so that the master node would still be scraped.

Upvotes: 0

Related Questions