Reputation: 51
We have a counter for some operation, say some_counter. This counter is increased each time the operation is performed for each customer (customer_id is the label). And the operation is usually performed once per day.
Further, I want to create an alert on grafana for a specific important customer (customer_id 1) if the operation is not performed in a day.
I used:
max by(customer_id) (idelta(some_counter{customer_id="1"}[1h]))
as the metric and reduced it as max over the last 25h. If this number is > 0, that means the operation was performed with in the last 25h.
The problem is that we had a machine restart/deployment and the counter was reset to undefined. When the operation happened the counter was set to 1 for customer_id 1 but the idelta
function returns 0. And my alert starts to fire.
I understand that idelta is not meant to be used only for guages, so I tried increase, irate and rate but they also show up as 0. I'd like some help understanding what the right metric and alert is for these kinds of scenarios.
For reference, this is the counter value using sum by(customer_id) (some_counter)
I also looked into the absent metric and maybe joining the two time series. But this seems quite extreme for a simple problem. There are a few similar stack overflow questions but they do not work for dynamic labels like customer_id.
Upvotes: 2
Views: 963
Reputation: 13351
To cite Prometheus documentation:
idelta
should only be used with gauges.
To find increase of your counter over last 25 hours, with correction for resets use increase
function:
increase(some_counter{customer_id="1"}[25h])
And for an alert rule firing an alert if counter didn't increase in last 25 hours, with possibility that counter was reset, and wasn't initialized yet you can use this expression:
increase(some_counter{customer_id="1"}[25h]) == 0
or on(customer_id) absent(some_counter{customer_id="1"})
Upvotes: 1