Disbalance
Disbalance

Reputation: 121

Prometheus count specific value occurrences

Can someone please help me to solve the task by counting a gauge value?

I have one metric but it always has different device labels, and it can only be 1 or 0, where 1 = error and 0 = OK.

I need to calculate the number of times when metrics were 1 by range variable provided in Grafana.

I tried operators sum, sum_over_time, and count and only sum_over_time provides truth result, but for some reason, it shows me one value multiple times

sum_over_time(aqa_device_health_checker{env="dev", device="FOO"}[1d])

Upvotes: 4

Views: 12018

Answers (2)

valyala
valyala

Reputation: 17784

I need to calculate the number of times when metrics were 1 by range variable provided in Grafana.

The following query should return the number of times the time series matching aqa_device_health_checker{env="dev", device="FOO"} series selector had value 1 on the selected time range in Grafana (aka $__range):

last_over_time(
  sum_over_time(
    aqa_device_health_checker{env="dev", device="FOO"}[$__range] offset -$__range
  )[$__range:$__range]
)

The query returns individual results per each matching time series. If you need summary result over all the matching time series, then just wrap the query above into sum():

sum(
  last_over_time(
    sum_over_time(
      aqa_device_health_checker{env="dev", device="FOO"}[$__range] offset -$__range
    )[$__range:$__range]
  )
)

Note that both queries above allow calculating the number of times the metric had 1 value if the metric could have either 0 or 1 values. If the metric can have other values, then these queries won't work as expected. Unfortunately, Prometheus doesn't provide easy to use functionality for counting the number of raw samples with some pre-defined value N. If you know beforehand the interval between samples (aka scrape_interval), then the following hack based on Prometheus subquery can be used:

count_over_time(
  (
    last_over_time(m[scrape_interval]) == N
  )[$__range:scrape_interval]
)

This query counts the number of raw samples with values equal to N on the time range $__range selected in Grafana.

If the interval between samples isn't known beforehand, then it is impossible to calculate the number of samples with a particular value in Prometheus. If you still need this functionality, then take a look at count_eq_over_time() function provided by VictoriaMetrics - this is Prometheus-like monitoring solution I work on. For example, the following query returns the exact number of samples with the value 10 over the last hour for time series m:

count_eq_over_time(m[1h], 10)

Upvotes: 1

anemyte
anemyte

Reputation: 20176

I came up with two solutions to this, choose whichever suits you best. For the purpose of simplifying things, let's assume your Prometheus scrape at 15 seconds interval and the error state lasted for 1 minute. Then, the gathered data would look like this:

state_metric 0 @t
state_metric 1 @t+15s
state_metric 1 @t+30s
state_metric 1 @t+45s
state_metric 1 @t+60s
state_metric 0 @t+75s

With changes()

This shows how many state changes were there. It would return 1 for the exemplary data above and it only gives adequate results if the gauge in question can hold exactly two possible values (for example 1 and 0).

changes(state_metric[1d])/2

changes() shows how many times the metric value has changed during the interval, while division by 2 is to compensate the state change back to normal. This is the downside of this method, which makes it only usable for detecting quick changes of state. But you probably have an alert when the error state hangs for some time, so I think this shouldn't be really a problem.

With a subquery

This is more precisely what you asked:

the number of times when metrics were 1

But there is a catch: with the exemplary data above, the query below will return you 4:

sum_over_time(count(state_metric == 1)[1d:])

[1d:] means repeat that instant query (count(state_metric == 1)) for each data point during last 1d. This is precisely the number of times when state_metric was 1 and it can be useful, for example, to calculate the downtime (just multiply by the scrape interval). Unlike the first method, this can work with any number of possible states, since you can define what you need in the condition.

Upvotes: 3

Related Questions