Reputation: 1686

PromQL: What is rate() function meant for?

I have a question regarding PromQL and its query functions rate() and how to use it properly. In my application, I have a thread running, and I use Micrometer's Timer to monitor the thread's runtime. Using Timer gives you a counter with suffix _count and another counter with the sum of the seconds spent with suffix _sum. E.g. my_metric_sum and my_metric_count.

My raw data looks like this (scrape interval 30 s, range vector 5m):

Now according to the docs, https://prometheus.io/docs/prometheus/latest/querying/functions/#rate calculates the per-second average rate of increase of the time series in the range vector (which is 5m here).

Now my question is: why would I want that? The relative change of my execution runtime seems pretty useless to me. In fact, just using sum/count looks more useful as it gives me the avg absolute duration for each moment in time. At the same time, and this is what confused me, in the docs I find

To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Source: https://prometheus.io/docs/practices/histograms/

But as I understand the docs, it looks like this expression would calculate the per-second average rate of increase of the request duration, ie not how long a request takes on average, but instead how much the request duration has changed on average in the last 5 minutes.

Upvotes: 11

Answers (3)

valyala

Reputation: 18010

The rate(m[d]) function calculates the increase of a counter metric m over the given lookbehind window d in square brackets and then divides the increase by d. The calculation is performed independently per each matching time series m. For example, suppose there are http_requests_total metrics with url label:

http_requests_total{url="/foo"}
http_requests_total{url="/bar"}

If they have the following values at time t0:

http_requests_total{url="/foo"} 123
http_requests_total{url="/bar"} 456

... and the following values at time t0 + 5 minutes:

http_requests_total{url="/foo"} 345
http_requests_total{url="/bar"} 789

Then rate(http_requests_total[5m]) at time t0 + 5 minutes is calculated in the following way:

To calculate increase for these metrics between t0 and t0 + 5 minutes:

increase(http_requests_total{url="/foo"}[5m]) = 345 - 123 = 222
increase(http_requests_total{url="/bar"}[5m]) = 789 - 456 = 333

To divide the calculated increase by 5 minutes expressed in seconds (5*60s = 300s):

rate(http_requests_total{url="/foo"}[5m]) = 222 / 300 = 0.74
rate(http_requests_total{url="/bar"}[5m]) = 333 / 300 = 1.11

So the end result of rate(http_requests_total[5m]) is a per-second average rps for the last 5 minutes, which is calculated individually per each time series with http_requests_total name.

A few notes:

Both rate() and increase() properly handle e.g. counter resets, when the counter is reset to zero.
Sometimes Prometheus can return unexpected results from rate() and increase() because of the chosen data model. See this issue. This issue is addressed in VictoriaMetrics - Prometheus-like monitoring system I work on - see this comment and this article.
Some PromQL-compatible query engines such as MetricsQL allow skipping the lookbehind window in square brackets when using rate() function, so rate(http_requests_total) is a valid MetricsQL query. In this case it automatically adds [$__interval] lookbehind window before query execution. See these docs for more details.

Upvotes: 20

Amir Mehler

Reputation: 4680

First of all - use the tool that matches your use case.

Second - whatever you choose, validate the data. And better do it now than during an outage or with an angry customer/user.

Third - _count and _bucket are features of histograms and summaries. The sampling frequency doesn't really matter here, as long as it's smaller than the [5m] grouping of the rate() function.

The rate simply gives you data points of "how many occurrences happened during these five minutes ([5m]).

General note - the rate() concept in Prometheus is causing a lot of confusion. It's debated between too many people. They should have probably called it something else.

Upvotes: 1

sskrlj

Reputation: 309

While I am not familiar with Micrometer Timer, the metric you're describing is of type Summary. It is counting the "events" in _count and summing the events magnitude, like duration, elapsed time and similar, in _sum. If you now perform rate(metric_count[5m]), you'll get the 5m average per second rate of your events. And if you want to know the average duration of these events within 5m window, you do rate(metric_sum[5m]) / rate(metric_count[5m]). If you try dividing metric_sum/metric_count, you'll get all time (since counter reset) average instead of 5m average at some point in time. In a way, it looks a bit funny to use rate() for this. Using increase() seems more intuitive to me, but mathematically it's exactly the same as rate() is just an increase()/range and so these ranges cancel each other out in rate(metric_sum[5m]) / rate(metric_count[5m]).

Upvotes: 2

PromQL: What is rate() function meant for?

Answers (3)

Related Questions