David Michael Gang
David Michael Gang

Reputation: 7299

Why is the moving average higher than the actual series in prometheus

Given a gauge called gin_in_flight_requests

We have two queries in prometheus:

green line:

sum(avg_over_time(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}[1m]))

yellow line

sum(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"})

The green line has 14:35 a higher peak than every individual point of the sum line but how can it be that the sum of averages over time produce a higher result then the max of the sum itself ?

sum of average over time vs plain sum

The graph was made with grafana 9 explore

Upvotes: 1

Views: 917

Answers (2)

Sascha Doerdelmann
Sascha Doerdelmann

Reputation: 846

The yellow line shows query evaluation points. Not the raw samples in the database.

See Average value not calculated right? and How does Prometheus DB calculate average value.

A query can get you higher or lower values than the original data.

Example

Raw data:

  • data point value 1 at second 0
  • data point value 1 at second 10
  • data point value 10 at second 20
  • data point value 2 at second 30

Panel showing Points every 15s beginning at second 1

  • retrieved value at second 1 is 1
  • retrieved value at second 16 is 1
  • retrieved value at second 31 is 2

But the average value between second 16 and second 31 is 6.

The effect will increase if you zoom out by choosing greater time ranges in Grafana.

Upvotes: 1

valyala
valyala

Reputation: 17890

By default Prometheus wraps time series selectors into last_over_time() rollup function with 5 minutes lookbehind window in square brackets if the time series selector isn't wrapped into any rollup function. So the sum(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}) query is automatically converted into the following query before execution:

sum(
  last_over_time(
    gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}[5m]
  )
)

See these docs for more details.

E.g. this query takes into account a subset of raw samples, actually the last raw samples just before each point displayed on the graph. It ignores the remaining raw samples. So it may return values smaller than the sum(avg_over_time(...)) query. If you want taking into account all the max raw samples, then use max_over_time function.

P.S. If you want capturing all the raw sample maximums and minimums on the selected time range in Grafana, then just use max_over_time() and min_over_time() queries with $__interval lookbehind window in square brackets:

sum(max_over_time(...[$__interval]))

and

sum(min_over_time(...[$__interval]))

P.P.S. FYI, an alternative Prometheus-like monitoring solution I work on - VictoriaMetrics - provides a rollup function, which simultaneously returns min, max and avg values on the selected time range. E.g. it can be used instead of three queries with min_over_time(), max_over_time() and avg_over_time() functions.

Upvotes: 4

Related Questions