Reputation: 395

Prometheus and frequent counter discontinuities

I've previously been working on a server-side project that uses prometheus and grafana to collect and display the metrics. That worked out pretty well.

I am now working on a client-side application. That is an app running on an android and iphone device. I've been asked to also use prometheus and grafana to collect and display the metrics coming from the app.

There are multiple challenges to overcome in this task. First, we need to figure out how the prometheus server will scrape all the client-side apps. While for the server-side project there were a limited and known number of servers (say 10 servers at well-known IP addresses), for the client-side app, there are going to be 1000s of apps running on many smartphones. I have already solved that first challenge and that is not the point of my question here.

The most important issue I am facing is that a user can start and close the app on their smartphone at any time they want. This means, there will potentially be 1000s of client-side app running at the same time and those apps will go online/offline very frequently.

Contrast that with a server-side service where I had only 10 instances of the service running for long stretches of time and prometheus was scraping them every 30 seconds.

To put things into context, let's say I have a simple prometheus counter that always increases in value:

requests_total{status="success"}
requests_total{status="failure"}

I could visualize the number of failed requests over time with this:

sum(increase(requests_total{status="failure"}[1m]))

That worked well for my server-side service that ran for long stretches of time. Once in a blue moon when I restart the service. There will be a discontinuity in the counter but that happens infrequently. The prometheus documentation for the increase function says:

... Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.

But for my client-side app, there will be 1000s of app instances running and they will go online/offline frequently. That means there could be a lot of discontinuities in my counter.

I am turning to the prometheus community for advice here. Does it make sense to use prometheus to collect metrics from client-side applications that can go online/offline frequently ? Or maybe prometheus was never designed to work with client-side applications.

Upvotes: 1

Answers (2)

RichardT

Reputation: 395

Thanks @sskrlj. Here is a bit more information about my situation. For collecting the data from all the clients, I could have used prometheus pushgateway as you've noted. But instead I will be collecting all data in elastic search first. That is every 30 seconds, all the clients will be pushing a small JSON doc into elastic search that has their counter value.

Then I will be running an instance of an in-house service that acts as an adapter ... similar to prometheus elastic search exporter (https://github.com/braedon/prometheus-es-exporter) to aggregate all the values from elastic search and expose a single prometheus counter for the prometheus server to scrape.

So there will be a single long-lived prometheus counter to scrape all the time. But the problem is that counter will not be monotically increasing. Because the users of the app can turn on/off the app at any time. That counter will fluctuate up and down a lot. So I don't know if sum(increase(counter...)) will give me anything meaninful. How does increase behave when there are many frequent breaks in monotonicity ?

On a side note, the in-house adapter service I implemented can also aggregate values from elastic search to produce a prometheus histogram. A histogram is just a collection of counters and there will be also a lot of break in monotonicity in the histogram. So I don't know how prometheus and grafana display it.

Upvotes: 0

sskrlj

Reputation: 309

Short lived counters must be one of the worst things you can do in Prometheus. Even if we leave aside capacity overuse (storage, RAM, CPU), you simply won't be able to get proper aggregates of these short lived counter rates or increases, when taking into account extrapolations used by Prometheus in rate/increase. I am experiencing exactly the same thing and while it's on server side, I am tracking an entity, which in some situations is being created for a single event.

If you decide to go this way anyway, at least make sure to start counters with 0, when instance of your "entity" is instantiated and not only when first event happens as you may end up with counters stuck on 1 for the whole time until they die. sum(rate()) will still be 0 for those even though you may have a million per second of them.

I am not sure if I can advise what's the best approach for client side monitoring in situation of many short lived instances. But I am interested to learn from others about what to do in such cases.

There are two issue to solve:

With scraping strategy, you may miss the little piece of information that existed.
Short lived counter as explained before, are usually useless even when properly scraped.

Given the above, I think your solution will be around push approach (not necessarily Prometheus PushGateway, see https://prometheus.io/docs/practices/pushing/) and/or some intermediate aggregation, even though that's normally against Prometheus principles. I think.

Pushing your events to an intermediate counter will assure you that you don't loose events between scrapes and since this intermediate counter will be long lived, so will their counters, which can then be happily scraped by the Prometheus.

So, let's hear it from others.

Upvotes: 0

Prometheus and frequent counter discontinuities

Answers (2)

Related Questions