Reputation: 1686
I have a question regarding PromQL and its query functions rate() and how to use it properly. In my application, I have a thread running, and I use Micrometer's Timer to monitor the thread's runtime. Using Timer gives you a counter with suffix _count and another counter with the sum of the seconds spent with suffix _sum. E.g. my_metric_sum and my_metric_count.
My raw data looks like this (scrape interval 30 s, range vector 5m):
Now according to the docs, https://prometheus.io/docs/prometheus/latest/querying/functions/#rate calculates the per-second average rate of increase of the time series in the range vector (which is 5m here).
Now my question is: why would I want that? The relative change of my execution runtime seems pretty useless to me. In fact, just using sum/count looks more useful as it gives me the avg absolute duration for each moment in time. At the same time, and this is what confused me, in the docs I find
To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Source: https://prometheus.io/docs/practices/histograms/
But as I understand the docs, it looks like this expression would calculate the per-second average rate of increase of the request duration, ie not how long a request takes on average, but instead how much the request duration has changed on average in the last 5 minutes.
Upvotes: 11
Views: 26427
Reputation: 18010
The rate(m[d])
function calculates the increase of a counter metric m
over the given lookbehind window d
in square brackets and then divides the increase by d
. The calculation is performed independently per each matching time series m
. For example, suppose there are http_requests_total
metrics with url
label:
http_requests_total{url="/foo"}
http_requests_total{url="/bar"}
If they have the following values at time t0
:
http_requests_total{url="/foo"} 123
http_requests_total{url="/bar"} 456
... and the following values at time t0 + 5 minutes
:
http_requests_total{url="/foo"} 345
http_requests_total{url="/bar"} 789
Then rate(http_requests_total[5m])
at time t0 + 5 minutes
is calculated in the following way:
t0
and t0 + 5 minutes
:increase(http_requests_total{url="/foo"}[5m]) = 345 - 123 = 222
increase(http_requests_total{url="/bar"}[5m]) = 789 - 456 = 333
5 minutes
expressed in seconds (5*60s = 300s
):rate(http_requests_total{url="/foo"}[5m]) = 222 / 300 = 0.74
rate(http_requests_total{url="/bar"}[5m]) = 333 / 300 = 1.11
So the end result of rate(http_requests_total[5m])
is a per-second average rps for the last 5 minutes, which is calculated individually per each time series with http_requests_total
name.
A few notes:
Both rate() and increase() properly handle e.g. counter resets
, when the counter is reset to zero.
Sometimes Prometheus can return unexpected results from rate()
and increase()
because of the chosen data model. See this issue. This issue is addressed in VictoriaMetrics - Prometheus-like monitoring system I work on - see this comment and this article.
Some PromQL-compatible query engines such as MetricsQL allow skipping the lookbehind window in square brackets when using rate()
function, so rate(http_requests_total)
is a valid MetricsQL query. In this case it automatically adds [$__interval]
lookbehind window before query execution. See these docs for more details.
Upvotes: 20
Reputation: 4680
First of all - use the tool that matches your use case.
Second - whatever you choose, validate the data. And better do it now than during an outage or with an angry customer/user.
Third - _count
and _bucket
are features of histograms and summaries. The sampling frequency doesn't really matter here, as long as it's smaller than the [5m]
grouping of the rate()
function.
The rate simply gives you data points of "how many occurrences happened during these five minutes ([5m]
).
General note - the rate()
concept in Prometheus is causing a lot of confusion. It's debated between too many people. They should have probably called it something else.
Upvotes: 1
Reputation: 309
While I am not familiar with Micrometer Timer, the metric you're describing is of type Summary. It is counting the "events" in _count
and summing the events magnitude, like duration, elapsed time and similar, in _sum
.
If you now perform rate(metric_count[5m])
, you'll get the 5m average per second rate of your events. And if you want to know the average duration of these events within 5m window, you do
rate(metric_sum[5m]) / rate(metric_count[5m])
. If you try dividing metric_sum/metric_count
, you'll get all time (since counter reset) average instead of 5m average at some point in time.
In a way, it looks a bit funny to use rate()
for this. Using increase()
seems more intuitive to me, but mathematically it's exactly the same as rate()
is just an increase()/range
and so these ranges cancel each other out in rate(metric_sum[5m]) / rate(metric_count[5m])
.
Upvotes: 2