Reputation: 208
Goal
Track RPM and Up time via grafana & prometheus
Situation
We are using
django-prometheus -> To emit metrics
fluent-bit -> Scrapes django metrics every 15s and pushes to prometheus
prometheus -> 2 shards running via prometheus operator on k8s
Problem
When we compare grafana dashboard with aws target group request metrics it isn't matching. Tried all below options
Expr: sum by(service) (irate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (increase(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (rate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
django_http_requests_before_middlewares_total -> This is Counter data type.
This counter never resets because we have unique dimension
- container_id
- service_name
- namespace
Q. Is it possible to create dashboard on grafana which resembles aws target group metrics ?
Ideally increase
should work but it takes diff continuously and that might be giving incorrect result.
Thanks in advance.
Upvotes: 3
Views: 3816
Reputation: 18084
In theory the following query should return the exact number of per-service requests for the last minute:
sum(
increase(django_http_requests_before_middlewares_total[1m])
) by (service)
But in practice Prometheus may return unexpected results for this query:
[1m]
in the query above) and the first raw sample in the lookbehind window.increase(m[d])
would return empty results for d <= 1m
.Prometheus developers are aware of these issues and are going to fix them - see this design doc.
In the mean time you can try using increase()
function in VictoriaMetrics - this is Prometheus-like monitoring solution I work on. Its' increase function is free from issues mentioned above.
An important note: both Prometheus and VictoriaMetrics calculate query results independently per each point displayed on the graph. So, if you need displaying per-minute number of requests using the query above, you need to set the interval between points on the graph (aka step
) to one minute.
Upvotes: 4
Reputation: 20296
tl;dr - no, Prometheus does not keep enough data to give perfectly precise values.
To see why, let's assume that 1 minute ago Prometheus has scraped a value of 10
for metric http_requests
and just now it has been updated to 40
.
It's already clear that with 1m
sampling you don't exactly know when during the last minute these 30 requests happened. Was it a short spike or were they distributed evenly? Regardless of that, rate(http_requests[1m])
will give you (40-10)/60s = 0.5
requests per second. Increase()
works in the same fashion, it's rate()*interval
or 0.5*60 = 30
.
Although, the example above shows precise values, it should be clear that you won't be able to achieve perfect precision with this math. The error is generally insignificant unless you are dealing with slow-moving counters (which update once in several minutes).
Upvotes: 3