erfan mehraban
erfan mehraban

Reputation: 494

Prometheus query for last local peak value

What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?

A local peak is a point that is larger than its previous and next datapoint. (So ​​the current time is definitely not a local peak)

sample graph (p: peak point, i: cornjob interval, m: missed execuation)

I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.

Upvotes: 2

Views: 1769

Answers (2)

Amin
Amin

Reputation: 643

If what you want is an alert to be fired when the elapsed time has been longer than a fixed duration, you can set an alert similar to the up alert, based on the changes > 0 expression, which is only true (i.e. > 0) when the job is running.

An example would be:

  rules:
  - alert: CronJobNotRunning
    expr: |
        changes(
            sum(
                rate(
                    cronjob_duration_time_seconds_count{
                        status="ok", namespace="<namespace>", exported_job="<job>"
                    }[1m]
                )
            )[1m:]
        ) == 0
    for: <alert_duration>

Note that subqueries ([1m:]) are expensive, and introducing a recording rule there can help performance, especially in a dashboard.

Also, in your case, the time since the last time the second derivative was non-zero can be used too, as that happens when a job starts/finishes (the drops in the graph, or when it starts to rise).

Upvotes: 0

Vahid Alamfard
Vahid Alamfard

Reputation: 46

Use z-score to detecting anomalies

If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.

  1. Calculate the average and standard deviation for the metric using data with large sample size.
# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])

# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
  1. calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.
# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) /  stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])

Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.

Upvotes: 3

Related Questions