Reputation: 587
I have a service that pushes forecast metrics every 1 minute to Prometheus using a push gateway and configured an alert rule in Prometheus.
Requirements:
The reason we need a stability filter is: sometimes the service is not able to push metrics due to the push gateway service being down for 1 minute and the push gateway service being recovered within 2 minutes, so we do not want to send firing alerts in this scenario.
Prometheus configurations:
evaluation_interval: 1m
scrape_interval: 30s
Alert rule:
- alert: forecaster
expr: rate(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m]) <= 0
for: 5m
I experimented stability filter using FOR clause, it works for firing alerts but it does not work for resolving alerts.
The service is not publishing forecasts for over the 5 minutes:
The service publishing forecasts over the 5 minutes:
I can change the evaluation interval to 5m but it affects other services. So I do not want to change it.
Is there any other way to set a stability filter (5m) in Prometheus for changing the alert state from firing to inactive(Resolved)?
Upvotes: 1
Views: 852
Reputation: 13432
Try
- alert: forecaster
expr: time()-timestamp(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}) > 5 * 60
annotations:
description: "Forecast has not been updated for {{ $value }} seconds"
Logic of previous answer about scrape and evaluation time applies here too.
Upvotes: 0
Reputation: 13432
You could use absent_over_time function to check if Prometheus has recieved specified metric for the last 5 minutes.
- alert: forecaster
expr: absent_over_time(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m])
If your metrics will be missing for 5 minutes absent_over_time
will return pseudometric with your labels, and alert will be fired.
Once your metric will appear again absent_over_time
will return nothing, and alert will be resolved.
Since in this case you don't have for:
clause in the alert rule - there will not be pending status. But firing of alert will not happen exactly after 5 minutes of missing metric, but rather 5 minutes + some time (depending on evaluation_interval
).
If you want firing time closer to 5 minutes you could set time for absent_over_time
equals to something like 5m - evaluation_interval/2 - scrape_interval/2
(manually calc depending on your config), I believe in your case 4m15s
Upvotes: 0