Reputation: 587

Is there way to set stability filter for alerts in prometheus?

I have a service that pushes forecast metrics every 1 minute to Prometheus using a push gateway and configured an alert rule in Prometheus.

Requirements:

5 minutes is the stability filter.
If forecasts are not published over the last 5 minutes, then trigger a firing alert.
If forecasts are published over the last 5 minutes, then resolve the alert.

The reason we need a stability filter is: sometimes the service is not able to push metrics due to the push gateway service being down for 1 minute and the push gateway service being recovered within 2 minutes, so we do not want to send firing alerts in this scenario.

Prometheus configurations:

evaluation_interval: 1m

scrape_interval: 30s

Alert rule:

- alert: forecaster
  expr: rate(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m]) <= 0
  for: 5m

I experimented stability filter using FOR clause, it works for firing alerts but it does not work for resolving alerts.

The service is not publishing forecasts for over the 5 minutes:

The time taken to change the alert state from inactive to pending is 1m i.e evaluation interval.
The time taken to change the alert state from pending to firing is between (FOR clause interval) and (evaluation_interval + scrape_interval + FOR clause interval) i.e 5m and (1m + 30s + 5m = 6m 30s).

The service publishing forecasts over the 5 minutes:

The time taken to change the alert state from firing to inactive (Resolved) is 1m i.e evaluation interval.

I can change the evaluation interval to 5m but it affects other services. So I do not want to change it.

Is there any other way to set a stability filter (5m) in Prometheus for changing the alert state from firing to inactive(Resolved)?

Upvotes: 1

Answers (2)

markalex

Reputation: 13432

Try

- alert: forecaster
  expr: time()-timestamp(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}) > 5 * 60
  annotations:
    description: "Forecast has not been updated for {{ $value }} seconds"

Logic of previous answer about scrape and evaluation time applies here too.

Upvotes: 0

markalex

Reputation: 13432

You could use absent_over_time function to check if Prometheus has recieved specified metric for the last 5 minutes.

- alert: forecaster
  expr: absent_over_time(forecasts_published_counter{job=\"metrics_job\", module_name=\"forecaster\"}[5m])

If your metrics will be missing for 5 minutes absent_over_time will return pseudometric with your labels, and alert will be fired.

Once your metric will appear again absent_over_time will return nothing, and alert will be resolved.

Since in this case you don't have for: clause in the alert rule - there will not be pending status. But firing of alert will not happen exactly after 5 minutes of missing metric, but rather 5 minutes + some time (depending on evaluation_interval).

If you want firing time closer to 5 minutes you could set time for absent_over_time equals to something like 5m - evaluation_interval/2 - scrape_interval/2 (manually calc depending on your config), I believe in your case 4m15s

Upvotes: 0

Is there way to set stability filter for alerts in prometheus?

Answers (2)

Related Questions