krinklesaurus
krinklesaurus

Reputation: 1686

PromQL: What is rate() function meant for?

I have a question regarding PromQL and its query functions rate() and how to use it properly. In my application, I have a thread running, and I use Micrometer's Timer to monitor the thread's runtime. Using Timer gives you a counter with suffix _count and another counter with the sum of the seconds spent with suffix _sum. E.g. my_metric_sum and my_metric_count.

My raw data looks like this (scrape interval 30 s, range vector 5m):

enter image description here

Now according to the docs, https://prometheus.io/docs/prometheus/latest/querying/functions/#rate calculates the per-second average rate of increase of the time series in the range vector (which is 5m here).

Now my question is: why would I want that? The relative change of my execution runtime seems pretty useless to me. In fact, just using sum/count looks more useful as it gives me the avg absolute duration for each moment in time. At the same time, and this is what confused me, in the docs I find

To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Source: https://prometheus.io/docs/practices/histograms/

But as I understand the docs, it looks like this expression would calculate the per-second average rate of increase of the request duration, ie not how long a request takes on average, but instead how much the request duration has changed on average in the last 5 minutes.

Upvotes: 11

Views: 26427

Answers (3)

valyala
valyala

Reputation: 18010

The rate(m[d]) function calculates the increase of a counter metric m over the given lookbehind window d in square brackets and then divides the increase by d. The calculation is performed independently per each matching time series m. For example, suppose there are http_requests_total metrics with url label:

http_requests_total{url="/foo"}
http_requests_total{url="/bar"}

If they have the following values at time t0:

http_requests_total{url="/foo"} 123
http_requests_total{url="/bar"} 456

... and the following values at time t0 + 5 minutes:

http_requests_total{url="/foo"} 345
http_requests_total{url="/bar"} 789

Then rate(http_requests_total[5m]) at time t0 + 5 minutes is calculated in the following way:

  1. To calculate increase for these metrics between t0 and t0 + 5 minutes:
increase(http_requests_total{url="/foo"}[5m]) = 345 - 123 = 222
increase(http_requests_total{url="/bar"}[5m]) = 789 - 456 = 333
  1. To divide the calculated increase by 5 minutes expressed in seconds (5*60s = 300s):
rate(http_requests_total{url="/foo"}[5m]) = 222 / 300 = 0.74
rate(http_requests_total{url="/bar"}[5m]) = 333 / 300 = 1.11

So the end result of rate(http_requests_total[5m]) is a per-second average rps for the last 5 minutes, which is calculated individually per each time series with http_requests_total name.

A few notes:

  • Both rate() and increase() properly handle e.g. counter resets, when the counter is reset to zero.

  • Sometimes Prometheus can return unexpected results from rate() and increase() because of the chosen data model. See this issue. This issue is addressed in VictoriaMetrics - Prometheus-like monitoring system I work on - see this comment and this article.

  • Some PromQL-compatible query engines such as MetricsQL allow skipping the lookbehind window in square brackets when using rate() function, so rate(http_requests_total) is a valid MetricsQL query. In this case it automatically adds [$__interval] lookbehind window before query execution. See these docs for more details.

Upvotes: 20

Amir Mehler
Amir Mehler

Reputation: 4680

First of all - use the tool that matches your use case.

Second - whatever you choose, validate the data. And better do it now than during an outage or with an angry customer/user.

Third - _count and _bucket are features of histograms and summaries. The sampling frequency doesn't really matter here, as long as it's smaller than the [5m] grouping of the rate() function.

The rate simply gives you data points of "how many occurrences happened during these five minutes ([5m]).

General note - the rate() concept in Prometheus is causing a lot of confusion. It's debated between too many people. They should have probably called it something else.

Upvotes: 1

sskrlj
sskrlj

Reputation: 309

While I am not familiar with Micrometer Timer, the metric you're describing is of type Summary. It is counting the "events" in _count and summing the events magnitude, like duration, elapsed time and similar, in _sum. If you now perform rate(metric_count[5m]), you'll get the 5m average per second rate of your events. And if you want to know the average duration of these events within 5m window, you do rate(metric_sum[5m]) / rate(metric_count[5m]). If you try dividing metric_sum/metric_count, you'll get all time (since counter reset) average instead of 5m average at some point in time. In a way, it looks a bit funny to use rate() for this. Using increase() seems more intuitive to me, but mathematically it's exactly the same as rate() is just an increase()/range and so these ranges cancel each other out in rate(metric_sum[5m]) / rate(metric_count[5m]).

Upvotes: 2

Related Questions