Max
Max

Reputation: 57

Complex rules/filters for Prometheus-Alertmanager Alerts

Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. Alertmanager sends alerts from production devices to PagerDuty.

The devices I'm monitoring have different models with different operating specs. Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C. Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature.

Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C?

Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds.

Here is a snippet from my alertmanager.yml that sends prod alerts to PagerDuty

- match:
    stack_name: prod
    severity: critical
  receiver: PagerDuty

Admittedly, I don't have a great deal of YML experience. but this is what I'm hoping to do, but I'm not sure of the correct syntax:

- match:
    stack_name: prod
    severity: critical
    alertname: !device_cpu_temperature
  receiver: PagerDuty
- match:
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: !*6X*
  receiver: PagerDuty
- match: 
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: *6X*
    value: >80
  receiver: PagerDuty

Desired outcome:

Or would it be better to have 2 different alert rules in prometheus? Can certain rules be applied to only certain devices? If so, how?

Upvotes: 5

Views: 10451

Answers (1)

Ignacio Millán
Ignacio Millán

Reputation: 8026

The easier would be to create different alert rules in Prometheus.

Actually the alert manager is only meant to send, group, filter, etc alerts, not to evaluate metrics.

You can achieve this with two different alerts in Prometheus configuration, filtering by hostname or any other label provided by the exporter.

The expression for servers 1-5 should be something like this:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname!~".*server_6.*"} > 50

And the rule for server 6:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname=~".*server_6.*"} > 70

The alerts have the same name so for the alert manager will be the same alert.

Upvotes: 9

Related Questions