Reputation: 57
Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. Alertmanager sends alerts from production devices to PagerDuty.
The devices I'm monitoring have different models with different operating specs. Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C. Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature.
Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C?
Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds.
Here is a snippet from my alertmanager.yml
that sends prod alerts to PagerDuty
- match:
stack_name: prod
severity: critical
receiver: PagerDuty
Admittedly, I don't have a great deal of YML experience. but this is what I'm hoping to do, but I'm not sure of the correct syntax:
- match:
stack_name: prod
severity: critical
alertname: !device_cpu_temperature
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: !*6X*
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: *6X*
value: >80
receiver: PagerDuty
Desired outcome:
Or would it be better to have 2 different alert rules in prometheus? Can certain rules be applied to only certain devices? If so, how?
Upvotes: 5
Views: 10451
Reputation: 8026
The easier would be to create different alert rules in Prometheus.
Actually the alert manager is only meant to send, group, filter, etc alerts, not to evaluate metrics.
You can achieve this with two different alerts in Prometheus configuration, filtering by hostname or any other label provided by the exporter.
The expression for servers 1-5 should be something like this:
- alert: ServiceProbeFailed
expr: cpu_temperature{hostname!~".*server_6.*"} > 50
And the rule for server 6:
- alert: ServiceProbeFailed
expr: cpu_temperature{hostname=~".*server_6.*"} > 70
The alerts have the same name so for the alert manager will be the same alert.
Upvotes: 9