vesna
vesna

Reputation: 335

Query prometheus counter across multiple instances

I got several instances exposing a Prometheus counter and would like to aggregate all values over a certain period of time. I've been trying a lot of different things but can't get it working.

Let's assume my metric name is request_total. This metric has facets for path and status_code. My goal is to get an overall sum of this counter, without filtering it by any of its facets. If I run sum by (instance) (request_total), I get the following graph from Prometheus:

enter image description here

As we can see my counter seems to be correct for each instance. However, if I try to sum all those values with sum (request_total), I get the following result:

enter image description here

I'm pretty new to Prometheus but was expecting that the counter would actually not be reset and better cumulative. Could you please help me and tell me what I am missing here ?

Thanks in advance

Upvotes: 4

Views: 9915

Answers (2)

valyala
valyala

Reputation: 17830

It is OK if Prometheus counters are periodically reset. If you need to get the total counter increase across multiple time series with graceful handling of counter resets, then wrap increase() function into sum(). For example, the following query would return the total number of requests over the last year:

sum(increase(requests_total[1y]))

Note that this query needs to load and scan raw samples on a year-long time range ending at the current time. So it may be quite slow. You may adjust the lookbehind window in square brackets according to your needs. See these docs for possible time durations.

Note also that Prometheus may return fractional results from increase() over time series with integer samples. This is due to extrapolation - see this issue for details. This issue has been solved in MetricsQL - see this article and this comment for technical details. MetricsQL also provides running_sum function, which can be used for drawing cumulative increase over the sum of counters. For instance, the following query returns a line, which starts from 0 on the left side of the graph and increases over the selected duration according the the cumulative increase of the sum of all the requests_total series, e.g. it returns cumulative number of requests over the selected time range:

running_sum(sum(increase(requests_total)))

Upvotes: 2

Nir Alfasi
Nir Alfasi

Reputation: 53525

Yes sum(request_total) should work and show the result across all the instances, and according to your graphs that's exactly what it does:

until ~8:30am there are two instances that report 4 and 11 requests, total of 15 which you can see in the second graph.

from ~8:33am to 8:42am only one instance reports one request and then another instance starts reporting one request as well which shows as going from 1 to 2 on the second accumulative graph.

Upvotes: 3

Related Questions