PatPanda
PatPanda

Reputation: 5010

Java Micrometer - What to do with metrics of type *_bucket

Quick question regarding metrics of type *_bucket please.

My application generates metrics, like those below:


# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.005592405",} 273.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.006990506",} 797.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.008388607",} 2638.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.009786708",} 3543.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.011184809",} 3932.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.01258291",} 4154.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.013981011",} 4279.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.015379112",} 4380.0

and

# HELP resilience4j_circuitbreaker_calls_seconds Total number of successful calls
# TYPE resilience4j_circuitbreaker_calls_seconds histogram
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001048576",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001398101",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001747626",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002097151",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002446676",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002796201",} 0.0

I believe they are really useful, but unfortunately, I do not know what to do with them.

I tried some queries such as rate(http_server_requests_seconds{_bucket_=\"+Inf\", status=~\"2..\"}[5m]), but does not seems to bring anything valuable out.

May I ask what is the proper way to use those metrics of type *_bucket, for instance, how to build Grafana dashboards and visuals that are the best suited for those *_bucket please?

Thank you

Upvotes: 3

Views: 8317

Answers (2)

thoredge
thoredge

Reputation: 12591

They are very useful to present as heatmaps. In Grafana select visualization Heatmap and in the query select Format also to be Heatmap. This simplest query should present something useful:

increase(resilience4j_circuitbreaker_calls_seconds_bucket[5m])

... and use {{le}} as Legend.

You may need to filter, sum or average other labels you may have, but be sure to calculate this ... by (le).

Upvotes: 1

Sumit Chawla
Sumit Chawla

Reputation: 116

you can find 99th percentile/95th percentile of the latency of given endpoint using this metric and can use histogram_quantile function for that. e.g. For 99th percentile :

histogram_quantile(
  0.99, 
  sum(
    rate(
      http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)

For 95th percentile :

histogram_quantile(
  0.95, 
  sum(
    rate(http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)

More on it: A nice snippet from reference: https://idanlupinsky.com/blog/application-monitoring-with-micrometer-prometheus-grafana-and-cloudwatch/

The histogram is a collection of buckets (or counters), each maintaining the number of events observed that took up to duration specified by the le tag. Let's have a look at a part of the histogram as published by our demo application:

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.067108864",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.089478485",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.111848106",} 92382.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.134217727",} 99050.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.156587348",} 99703.0
...
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.984263336",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="1.0",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="+Inf",} 100000.0

The second line in the listing above indicates there were no requests observed that took up to ~89ms (specified by the le tag). This is expected given the 100ms sleep time when processing requests. Line #3 shows that 92,382 requests were observed whose duration took up to ~111ms. Note that the histogram is cumulative and that the entire count of requests falls in the last bucket with no upper limit le="+Inf".

Upvotes: 8

Related Questions