Reputation: 301
I am doing some monitoring with prometheus and is trying to understand how to properly use the rate functions.
Premise is this; I have a counter, configuration for this is set to ingest new values every 15s.
Now I am trying to graph the per second rate of this, so using the rate function I do this as:
rate(pgbouncer_sent_bytes_total{job="pgbouncer", database="worker"}[1m])
As I interpret the rate function, the query will give me a rolling rate average (in 1m look back windows) at each point in time that is queried. The interval of points is appointed by the resolution used.
Below is a screenshot from the prometheus console including the raw data graph and the plot from the rate query above using a 1m resolution. Now the resulting rate graph here does not really match my expectations looking at the raw data in the bottom graph.
The interesting bit it also that the resulting graph will look very different depending on the point in time it is loaded. Simply reloading the same graph a couple of subsequent times will completely shift the looks to a point where it does not even looks as it is representing the same data. Image below is the same dataset a few minutes after, but the same occurs even seconds after.
Could someone shed some light on what is really going on here?
Upvotes: 20
Views: 35669
Reputation: 18094
The rate()
function in Prometheus can miss some increases for slow-changing time series as Alin explained in this answer. See also this issue. Prometheus developers are going to fix this in the near future according to Alin's design doc.
There is a workaround though - to use rate() function from MetricsQL. It is free from issues mentioned above, so it should return the expected results for both fast-changing counters and slow-changing counters. See technical details here and here.
Upvotes: 0
Reputation: 10134
AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate()
implementation discarding every 4th counter increase (in your particular setup).
More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).
What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:
rate()
returns something closer to 90 QPS). This is what happens in the second half of your graph.This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate()
is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>
Essentially graphing a Prometheus rate()
or increase()
over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.
You can work around it by replacing your expression with:
rate(foo[75s]) / 75 * 60
This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate()
.
BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).
Upvotes: 34
Reputation: 34172
What you say doesn't line up with the data, that raw data is only going up about once a minute. Are you sure you're scraping every 15s?
Upvotes: 2