Reputation: 41
I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Upvotes: 2
Views: 1048
Reputation: 51
I'm aware that I'm replying to an old thread but maybe it will be still helpful to someone.
I understand that you want to build an SLO which could be phrased as:
90th percentile of requests latency is lower than 1000ms 99.9% of the time over a week.
I added the week part as I haven't found over what period your SLO is considered.
I believe it can be solved by the following PromQL query:
avg_over_time(
(histogram_quantile(
0.9,
sum by (le) (rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]))
) <bool 1)[7d:]
) * 100
The trick is that it calculates the 90th
percentile and then performs a binary quantization of the result. In other words, it creates a vector of 1
s and 0
s corresponding to when the target value is met and when it's not. Then, the avg_over_time
function produces the ratio of good to all samples.
I faced a very similar problem. Based on my solution I wrote this article. I've put there a few remarks why in general it's better to avoid percentiles in SLO. You might find it interesting: https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step#formula-for-latency-slo-with-percentiles
Upvotes: 2
Reputation: 18010
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.
Upvotes: 3
Reputation: 10110
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Upvotes: 0