Alek Storm
Alek Storm

Reputation: 777

Break down Datadog COUNT/GAUGE without double-counting

I have a script that queries our CI (Buildkite)'s API once per minute to fetch details of all build agents and emit metrics to Datadog for analysis. Getting an accurate count of these agents in the Datadog UI has proven challenging, however.

If the script emits a COUNT metric for each agent it sees, then agents will be double-counted in the Datadog UI when the interval is longer than a minute, because the script runs once per minute and sees (mostly) the same agents each time. The script could total up the number of agents it sees each run and emit that as a GAUGE, but then I lose the ability to break down the count in the Datadog UI by agent-specific tags (queue, etc).

I suppose I could emit a GAUGE with a value of 1 for each agent on each run, and add an artificial index tag with a value of the numeric index in the agent array, and rely on the Datadog UI to do the summation across index values? I could use the agent ID/host, of course, but Datadog charges by number of tag values and we've got our agents in an auto-scaling group, so hosts change frequently.

This seems hacky - is there a better solution? Am I overthinking this?

Upvotes: 1

Views: 2551

Answers (1)

draav
draav

Reputation: 1953

You could tag your metrics with the name or ID of the agent it is collecting metrics from (if you aren't already). Then in Datadog you could write a query that groups by the agent ID and applies a count_not_null function: https://docs.datadoghq.com/dashboards/functions/count/

This basically hijacks a random metric to extract the unique count of agents reporting that metric to assume the total count of agents. You wouldn't be able to easily group by queue though, so idk if it would be a good solution to your use case.


Your idea around using gauges sounds good to me. You can send a new metric called something like myagent.running which sends a value of 1 for each of your agents and does a sum of all gauges in order to get a count. That is actually how the metric datadog.agent.running is implemented: https://docs.datadoghq.com/integrations/agent_metrics/#metrics

Upvotes: 1

Related Questions