Lee Jensen
Lee Jensen

Reputation: 2281

Don't display graphite metrics without data in the time range

I have a cluster of AWS servers that I track statistics using Graphite. The servers in the cluster change as new versions of software are deployed or as the cluster size grows or shrinks.

For example:

Metrics added yesterday
servers.1.cpu
servers.2.cpu

Metrics added today
servers.2.cpu
servers.3.cpu

When I view my data through Graphite I only want it to show me metrics for the servers that have data for the time period that I am querying. However, because I don't know what servers were available in the time period I specify * in the query and this results in all servers that have every existed in the cluster to be included in the series.

query last 15 minutes:
servers.*.cpu

Results in:
servers.1.cpu
servers.2.cpu
servers.3.cpu

Is there a way to filter out the servers that don't have data from the series? So that in the above example I don't include metrics from servers.1.cpu. However, if my query time period was yesterday I would get servers.1.cpu, but not servers.3.cpu.

Upvotes: 2

Views: 1108

Answers (2)

Richard Barnett
Richard Barnett

Reputation: 1108

We have a similar problem with metrics sent via statsd to Graphite. In some cases I've been able to use currentAbove(0) to filter out series without "interesting" values; this was successful for values from the collectd load plugin.

Gauges are a particular problem, since once an AWS instance is terminated all gauge metrics from that instance will remain "stuck" at their last value.

Some ideas I had around this area:

  • Use CloudWatch Events to set all gauges for terminated instances to 0; unfortunately Graphite's poor search api would make this a bit challenging
  • Add custom functions to Graphite, eg it would be fairly easy to write a function to filter out a series where the first & last values were the same
  • We're using Grafana so we could add a scripted dashboard where the script fetches the current AWS hostnames (from the Salt master, handwavey handwavey) & dynamically populates the series for the dashboard.

Developing the idea of filtering out constant series, averageAbove(integral(nonNegativeDerivative(...)), 1) seems like a good start but I can't work out how to display only the original series.

Upvotes: 0

Lee Jensen
Lee Jensen

Reputation: 2281

Graphite allocates all the space for a metric the first time it receives a single value. This results in a very inefficient representation for any system where the metrics are sparse. For example in a system where the servers are highly dynamic. I resolved on two possible solutions:

  1. Use slot names for the metrics rather than the actual server identifiers (IPs). I really don't like this because it causes you to look up the server using the slot name before you can actually go to the server that generated the metrics.

  2. Use InfluxDB instead. InfluxDB only stores the metrics that you actually provide and queries only return data if there actually is data to return. This results is a compact representation that only shows you data for the metrics that actually had data during the time span that is queries.

Upvotes: 1

Related Questions