Jay Xue
Jay Xue

Reputation: 79

How to detect a new metrics with Prometheus alerting rule

Say I have a metrics request_failures for users. For each user I add a unique label value to the metrics. So for user u1, when a request failed twice, I get the following metrics:

    request_failures{user_name="u1"} 2

I also have a rule that fires when there are new failures. Its expression is:

    increase(request_failures[1m]) > 0

This works well for a user that already encountered failures. For example, when u1 encounters the third failure, the rule fires.

When a request failed for a new user u2, I get the metrics as:

    request_failures{user_name="u1"} 2
    request_failures{user_name="u2"} 1

Now the problem is that the alert rule doesn't fire for u2. It seems that the rule cannot recognize a "new metrics", although all the three metrics are identically request_failures, just with different labels.

Anyone can point out how I should construct the rule?

Upvotes: 8

Views: 5635

Answers (3)

anemyte
anemyte

Reputation: 20176

As already put by @MichaelDoubez , increase() does not consider newly created metric as a value increase. Unfortunately, same goes for changes(). There are reasons for that, such as a missing scrape for example, but it still can be solved with a query.

increase(request_failures[10m]) > 0
or
( request_failures unless request_failures offset 10m )

The second part (beginning with or) will fire for 10 minutes (defined by the offset) when there is a new metric.

Upvotes: 5

Michael Doubez
Michael Doubez

Reputation: 6863

The reason the rule doesn't fire is that the increase() function doesn't consider a counter newly created to be 0 before the first scrape. I didn't find any source on that but it seems to be the case.

Therefore you want to detect two cases:

  • if a user has an issue while he doesn't have one before
  • if a user has a new issue in the last N minutes

This can be rephrased in the opposite logic:

a alert should be triggered for a user with errors unless there was no increase in errors in the last N minutes for this user

Which readily translates into the following promql:

rule: request_failures > 0 UNLESS increase(request_failures[1m]) == 0

On hindsight, regarding the increase() function, it cannot assume the previous value is 0 because it is expressed inside a range. The previous value may be out of range and not equal to 0. So it makes sense to have at least two points to have a value.

Upvotes: 3

Jay Xue
Jay Xue

Reputation: 79

This should be the answer: https://www.robustperception.io/dont-put-the-value-in-alert-labels.

The key is that the label should not include variable values as it is a part of the identity of a metric. The solution is to add username as annotation instead of label of a metric.

Upvotes: -1

Related Questions