Reputation: 21
We have a number of prometheus servers, each one monitors its own region (actually 2 per region), there are also thanos servers that can query multiple regions, and we also use alertmanager for the alerting.
Lately, we had an issue that few metrics stopped to report and we only discovered it when we needed the metrics. We are trying to find out how to monitor the changes in the number of reported metrics in a scalable system that grow and shrink as required.
I'll be glad about your advice.
Upvotes: 1
Views: 2969
Reputation: 10064
You can either count the number of timeseries in the head chunk (last 0-2 hours) or the rate at which you're ingesting samples:
prometheus_tsdb_head_series
or
rate(prometheus_tsdb_head_samples_appended_total[5m])
Then you compare said value with itself a few minutes/hours ago, e.g.
prometheus_tsdb_head_series / prometheus_tsdb_head_series offset 5m
and see whether it fits within an expected range (say 90-110%) and alert otherwise.
Or you can look at the metrics with the highest cardinality only:
topk(100, count({__name__=~".+"}) by (__name__))
Note however that this last expression can be quite costly to compute, so you may want to avoid it. Plus the comparison with 5 minutes ago will not be as straightforward:
label_replace(topk(100, count({__name__=~".+"}) by (__name__)), "metric", "$1", "__name__", "(.*)")
/
label_replace(count({__name__=~".+"} offset 5m) by (__name__), "metric", "$1", "__name__", "(.*)")
You need the label_replace
there because the match for the division is done on labels other than __name__
. Computing this latest expression takes ~10s on my Prometheus instance with 150k series, so it's anything but fast.
And finally, whichever approach you choose, you're likely to get a lot of false positives (whenever a large job is started or taken down), to the point that it's not going to be all that useful. I would personally not bother trying.
Upvotes: 4