Reputation: 2281
I have a cluster of AWS servers that I track statistics using Graphite. The servers in the cluster change as new versions of software are deployed or as the cluster size grows or shrinks.
For example:
Metrics added yesterday
servers.1.cpu
servers.2.cpuMetrics added today
servers.2.cpu
servers.3.cpu
When I view my data through Graphite I only want it to show me metrics for the servers that have data for the time period that I am querying. However, because I don't know what servers were available in the time period I specify *
in the query and this results in all servers that have every existed in the cluster to be included in the series.
query last 15 minutes:
servers.*.cpu
Results in:
servers.1.cpu
servers.2.cpu
servers.3.cpu
Is there a way to filter out the servers that don't have data from the series? So that in the above example I don't include metrics from servers.1.cpu
. However, if my query time period was yesterday I would get servers.1.cpu
, but not servers.3.cpu
.
Upvotes: 2
Views: 1108
Reputation: 1108
We have a similar problem with metrics sent via statsd to Graphite. In some cases I've been able to use currentAbove(0)
to filter out series without "interesting" values; this was successful for values from the collectd load plugin.
Gauges are a particular problem, since once an AWS instance is terminated all gauge metrics from that instance will remain "stuck" at their last value.
Some ideas I had around this area:
Developing the idea of filtering out constant series, averageAbove(integral(nonNegativeDerivative(...)), 1)
seems like a good start but I can't work out how to display only the original series.
Upvotes: 0
Reputation: 2281
Graphite allocates all the space for a metric the first time it receives a single value. This results in a very inefficient representation for any system where the metrics are sparse. For example in a system where the servers are highly dynamic. I resolved on two possible solutions:
Use slot names for the metrics rather than the actual server identifiers (IPs). I really don't like this because it causes you to look up the server using the slot name before you can actually go to the server that generated the metrics.
Use InfluxDB instead. InfluxDB only stores the metrics that you actually provide and queries only return data if there actually is data to return. This results is a compact representation that only shows you data for the metrics that actually had data during the time span that is queries.
Upvotes: 1