dotslashlu
dotslashlu

Reputation: 3401

How to use Prometheus to alert specific error message?

I'm trying to collect an application's running status, if an error happens, then use alertmanager to alert.

I read docs about metric types, it seems gauge vec is the only suitable type. Currently my metric definition is like(it's in Go, but you can get the idea):

errored = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "validate_errored"
    },
    []string{"module"},
)

1 will be reported when error has happened. And alertmanager is configured to alert when validate_errored becomes 1.

But now I need to know the exact error in the alert message, so I decided to add a new label:

errored = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "validate_errored"
    },
    []string{"module", "error"},
)

Errors will be alerted successfully, but problem with this way is that Prometheus seems to aggregate over each unique labels when querying, each different error message becomes a line on the chart.

I have also read that it could be a problem if I use labels to hold variable data which I have now forgotten the source.

So what is the idiomatic way to alert a specific error?

Upvotes: 2

Views: 8101

Answers (1)

ahus1
ahus1

Reputation: 5932

Reading your question I assume that once an error occurs, the metrics will be "1" until the the application is restarted. Or the status might be reset once the condition has been cleared by a user.

If this is a status that will later be cleared, a Gauge will be thing to be used. If you want to report/alert on how many errors (of which type) occur, a Counter might be the right be more suitable.

Prometheus is a good tool for recording and alerting on metrics (and status) information.

If you want to alert on events (the fact that an error occurred), something like a log management solution might be more suitable. A log can also provide more in-depth information what happend.

You can add the error as a label as long there is no "metrics explosion". If the number of error types is reasonably low, you can it as a label. Something like a user ID (with an unlimited amount of values) should not be used as a label as it would result in a metrics explosion. This is also stated on the Prometheus docs.

Adding a label to be more specific when to send an alert is usually a good thing. Adding a label to show it in the alert message is technically feasible, but not the best reason to add label as it creates additional time series for each label value (IMHO).

Upvotes: 1

Related Questions