Reputation: 41

SLO calculation for 90% of requests under 1000ms

I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:

histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )

And I can find what percentage of ALL requests were served in 1000ms or less with this.

((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100

Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?

I tried the most obvious (to me anyway) solution, but got no data back.

histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )

The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.

Upvotes: 2

Answers (3)

Michal Kazmierczak

Reputation: 51

I'm aware that I'm replying to an old thread but maybe it will be still helpful to someone.

I understand that you want to build an SLO which could be phrased as:

90th percentile of requests latency is lower than 1000ms 99.9% of the time over a week.

I added the week part as I haven't found over what period your SLO is considered.

I believe it can be solved by the following PromQL query:

avg_over_time(
  (histogram_quantile(
    0.9,
    sum by (le) (rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]))
  ) <bool 1)[7d:]
) * 100

The trick is that it calculates the 90th percentile and then performs a binary quantization of the result. In other words, it creates a vector of 1s and 0s corresponding to when the target value is met and when it's not. Then, the avg_over_time function produces the ratio of good to all samples.

I faced a very similar problem. Based on my solution I wrote this article. I've put there a few remarks why in general it's better to avoid percentiles in SLO. You might find it interesting: https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step#formula-for-latency-slo-with-percentiles

Upvotes: 2

valyala

Reputation: 18010

Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:

histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))

Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:

histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9

Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.

Upvotes: 3

Isaiah4110

Reputation: 10110

Welcome to SO.

Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:

You are basically measuring your

(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100

Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.

Upvotes: 0

SLO calculation for 90% of requests under 1000ms

Answers (3)

Related Questions