Shivakumar Sajjan
Shivakumar Sajjan

Reputation: 587

Is there way to set stability filter for alerts in prometheus?

I have a service that pushes forecast metrics every 1 minute to Prometheus using a push gateway and configured an alert rule in Prometheus.

Requirements:

The reason we need a stability filter is: sometimes the service is not able to push metrics due to the push gateway service being down for 1 minute and the push gateway service being recovered within 2 minutes, so we do not want to send firing alerts in this scenario.

Prometheus configurations:

evaluation_interval: 1m

scrape_interval: 30s

Alert rule:

- alert: forecaster
  expr: rate(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m]) <= 0
  for: 5m

I experimented stability filter using FOR clause, it works for firing alerts but it does not work for resolving alerts.

The service is not publishing forecasts for over the 5 minutes:

The service publishing forecasts over the 5 minutes:

I can change the evaluation interval to 5m but it affects other services. So I do not want to change it.

Is there any other way to set a stability filter (5m) in Prometheus for changing the alert state from firing to inactive(Resolved)?

Upvotes: 1

Views: 852

Answers (2)

markalex
markalex

Reputation: 13432

Try

- alert: forecaster
  expr: time()-timestamp(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}) > 5 * 60
  annotations:
    description: "Forecast has not been updated for {{ $value }} seconds"

Logic of previous answer about scrape and evaluation time applies here too.

Upvotes: 0

markalex
markalex

Reputation: 13432

You could use absent_over_time function to check if Prometheus has recieved specified metric for the last 5 minutes.

- alert: forecaster
  expr: absent_over_time(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m])

If your metrics will be missing for 5 minutes absent_over_time will return pseudometric with your labels, and alert will be fired.

Once your metric will appear again absent_over_time will return nothing, and alert will be resolved.

Since in this case you don't have for: clause in the alert rule - there will not be pending status. But firing of alert will not happen exactly after 5 minutes of missing metric, but rather 5 minutes + some time (depending on evaluation_interval).

If you want firing time closer to 5 minutes you could set time for absent_over_time equals to something like 5m - evaluation_interval/2 - scrape_interval/2 (manually calc depending on your config), I believe in your case 4m15s

Upvotes: 0

Related Questions