Reputation: 298
I have run into a bit of a trouble for what is seems to be an easy question.
My scenario: I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.
What I want:
I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time.
I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.
What I tried: The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):
kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}
I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day
(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600
Because these pods come and go and are not always present I'm encountering the following problems:
Upvotes: 5
Views: 3855
Reputation: 298
Thanks to the suggestion of @HelloWorld i think this would be the best solution to achieve what I wanted:
(sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d:1s]) > 3600) and (kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}==1)
Upvotes: 6