PatPanda
PatPanda

Reputation: 5010

CPU metrics in Grafana for Spring Webflux app with Actuator Micrometer and Prometheus

Small question on how to build visual and insight on CPU metrics please.

I have a Spring Boot Webflux app, nothing extraordinary. I bring in the Actuator, Micrometer and Prometheus dependencies.

The app have out of the box metrics for CPU, which I think is very cool. I also believe those metrics contains tremendous information. Unfortunately, I believe not to understand Grafana or the metrics itself to fully unleash it potential.

The metrics are :

system_cpu_usage
process_cpu_usage
system_cpu_count
system_load_average_1m

Not knowing how to properly use them, I use those very basic noob queries:

system_cpu_usage{_ns_=“my_namespace",cluster=~”my_cluster"}
process_cpu_usage{_ns_=“my_namespace",cluster=~”my_cluster"}
system_cpu_count{_ns_=“my_namespace",cluster=~”my_cluster”}
system_load_average_1m{_ns_=“my_namespace",cluster=~”my_cluster"}

And with those, I do get some results back. The thing is, I get just some flat lines in which no further insights or action can be taken.

I see on the web some more complex queries, such as

avg_over_time(process_cpu_usage{_ns_=“my_namespace",cluster=~”my_cluster"}[1h])

Or some using delta rate irate. But not sure what are they here for.

What is the proper way to use those metrics and what is wrong with my current queries as there is a gap between now and meaningful metrics.

Thank you.

Upvotes: 2

Views: 2259

Answers (1)

Felipe
Felipe

Reputation: 7563

The usage of avg_over_time for the last 1h is useful in case you want to make a rule for the alertmanager. Imagine a use case where every spike on the cpu will trigger the alertmanager rule. That is something undesirable. By the way, in this specific use case I would prefer to use histogram_quantile becase average can hide high values (just because it is an average). Some best practices with percentiles are here: https://prometheus.io/docs/practices/histograms/#quantiles . Then you use rate to determine the time window for your quantile.

histogram_quantile(0.9, rate(
  process_cpu_usage{_ns_=“my_namespace",cluster=~”my_cluster"}[1h]
))

Upvotes: 1

Related Questions