Reputation: 3367
I have a question about calculating response times with Prometheus summary metrics.
I created a summary metric that does not only contain the service name but also the complete path and the http-method.
Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.
As far as I read this should be the correct way to calculate the response time per second:
sum by(service_id) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).
Upvotes: 16
Views: 33381
Reputation: 18010
by
modifier groups aggregate function results by labels enumerated inside by(...)
.without
modifier groups aggregate function results by all the labels except those enumerated inside without(...)
.For example, suppose process_resident_memory_bytes
metric exists with job
, instance
and datacenter
labels:
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4
Then sum(process_resident_memory_bytes) by (datacenter)
would return summary per-datacenter
memory usage, while sum(process_resident_memory_bytes) without (instance)
would return summary per-job
per-datacenter
memory usage.
Upvotes: 13
Reputation: 567
Using Prometheus metrics in Grafana, the without
keyword did not work for me (at least as I expected it to). I got satisfying results with by
:
sum by (status_code)(
rate(request_duration_sum{status_code=~"2.*"}[5m])
)
/
sum by (status_code)(
rate(request_duration_sum{status_code=~"2.*"}[5m])
)
Upvotes: 0
Reputation: 34142
All of these examples are aggregating incorrectly, as you're averaging an average. You want:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
)
/
sum without (path,host) (
rate(request_duration_count{status_code=~"2.*"}[5m])
)
Which will return the average latency per status_code
plus any other remaining labels.
Upvotes: 14