Shashank Agrawal
Shashank Agrawal

Reputation: 345

GCP alerting policies based on percentage

I am trying to create some alerting policies in GCP for my application hosted in Kubernetes cluster. We have a Cloud load balancer serving the traffic and I can see the HTTP status codes like 2XX, 5XX etc.

I need to create some alerting policies based on the error percentage rather than the absolute value like ((NumberOfFailures/Total) * 100) so that if my error percentage goes above say 50% then trigger an alert.

I couldn't find anything on the google documentation. It just tells you to use counter which is like using an absolute value. I am looking for something like if the failure rate goes beyond 50% in a rolling window of 15 minutes then trigger the alert.

Is that even possible to do that natively in GCP?

Upvotes: 6

Views: 1776

Answers (1)

p13rr0m
p13rr0m

Reputation: 1297

Yes, I think this is possible with MQL. I have recently created something similar to your use case.

fetch api
    | metric 'serviceruntime.googleapis.com/api/request_count'
    | filter
        (resource.service == 'my-service.com')
    | group_by 10m, [value_request_count_aggregate: aggregate(value.request_count)]
    | every 10m
    | { group_by [metric.response_code_class],
        [response_code_count_aggregate: aggregate(value_request_count_aggregate)]
    | filter (metric.response_code_class = '5xx')
        ; group_by [],
    [value_request_count_aggregate_aggregate:
        aggregate(value_request_count_aggregate)] }
    | join
    | value [response_code_ratio: val(0) / val(1)]
    | condition gt(val(), 0.1)

In this example, I am using the request count for a service my-service.com. I am aggregating the request count over the last 10 minutes and responses with response code 5xx. Additionally, I am aggregating the request count over the same time period, but all response codes. Then in the last two lines, I am computing the ratio of the number of 5xx status codes with the number of all response codes. Finally, I create a boolean value that is true when the ratio is above 0.1 and that I can use to trigger an alert.

I hope this gives you a rough idea of how you can create your own alerting policy based on percentages.

Upvotes: 5

Related Questions