JayBe
JayBe

Reputation: 31

Prometheus - combining two alerts with and and vectoring

we have several alerts and we want to combine these alerts to one big alert for CPU, Memory and Disk IO.

For example:

rules:
  - alert: OutOfMemory
    annotations:
      description: "Node memory is filling up (< 5% left)\n VALUE = {{ $value }}"
      summary: Out of memory (instance {{ $labels.instance }})
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5
    for: 5m
    labels:
      severity: warning

and

  - alert: HighCpuLoad
    annotations:
      description: "CPU load is > 90%\n VALUE = {{ $value }}"
      summary: High CPU load (instance {{ $labels.instance }})
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning

We can't figure out how those alerts would look like combined with the operator "and" plus vectoring. Can someone help us out here?

Best regards

Upvotes: 0

Views: 2834

Answers (1)

Michael Doubez
Michael Doubez

Reputation: 6863

You would have to use the vector matching instruction which, in brief and in simple cases such as yours, translates to indicate which labels should match on both sides of the operator.

In the case of the node exporter it would be:

(<OutOfMemory expression>) AND ON(instance) (<HighCpuLoad expression>)

From a usability point of view, I would rather have multiple alerts which are not sent to your alerting system (use a black hole in alertmanager) and then use the ALERTS metric to trigger you big alert. It will allow you to have:

  • more simple and more rich expressions (you could trigger alert if 4/5 are firing or have some OR clause)
  • looking at a dashboard, you will know which issues are exactly firing
  • different forstatements - you may not want to have the same for for high cpu and memory outage.

I have not tested it but it would look like the following:

rules:
  - alert: NodeInTrouble
    expr: sum(ALERTS{alertname=~"OutOfMemory|HighCpuLoad"}) BY (instance) == 2
    for: 1m
    labels:
      severity: warning

Upvotes: 1

Related Questions