Paul Chibulcuteanu
Paul Chibulcuteanu

Reputation: 298

Prometheus alerting when a pod is running for too long

I have run into a bit of a trouble for what is seems to be an easy question.

My scenario: I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.

What I want: I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time. I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.enter image description here

What I tried: The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):

kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}

I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day

(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600

Because these pods come and go and are not always present I'm encountering the following problems:

Upvotes: 5

Views: 3855

Answers (1)

Paul Chibulcuteanu
Paul Chibulcuteanu

Reputation: 298

Thanks to the suggestion of @HelloWorld i think this would be the best solution to achieve what I wanted:

(sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d:1s]) > 3600) and (kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}==1)
  • Count the number of times pods is running in the past day/6h/3h and verify if that exceeds 1h(3600s) AND
  • Check if the pod is still running - so that it doesn't take into consideration old pods or if the pod terminates.

Upvotes: 6

Related Questions