Nodelay Heehoo
Nodelay Heehoo

Reputation: 23

prometheus query for continuous uptime

I'm a prometheus newbie and have been trying to figure out the right query to get the last continuous uptime for my service.

For example, if the present time is 0:01:20 my service was up at 0:00:00, went down at 0:01:01 and went up again at 0:01:10, I'd like to see the uptime of "10 seconds".

I'm mainly looking at the "up{}" metric and possibly combine it with the functions (changes(), rate(), etc.) but no luck so far. I don't see any other prometheus metric similar to "up" either.

Upvotes: 0

Views: 20902

Answers (2)

valyala
valyala

Reputation: 17860

The following PromQL query must be used for calculating the application uptime in seconds:

time() - process_start_time_seconds

This query works for all the applications written in Go, which use either github.com/prometheus/client_golang or github.com/VictoriaMetrics/metrics client libraries, which expose the process_start_time_seconds metric by default. This metric contains unix timestamp for the application start time.

Kubernetes exposes the container_start_time_seconds metric for each started container by default. So the following query can be used for tracking uptimes for containers in Kubernetes:

time() - container_start_time_seconds{container!~"POD|"}

The container!~"POD|" filter is needed in order to filter aux time series:

  • Time series with container="POD" label reflect e.g. pause containers - see this answer for details.
  • Time series without container label correspond to e.g. cgroups hierarchy. See this answer for details.

If you need to calculate the overall per-target uptime over the given time range, then it is possible to estimate it with up metric. Prometheus automatically generates up metric per each scrape target. It sets it to 1 per each successful scrape and sets it to 0 otherwise. See these docs for details. So the following query can be used for estimating the total uptime in seconds per each scrape target during the last 24 hours:

avg_over_time(up[24h]) * (24*3600)

See avg_over_time docs for details.

Upvotes: 0

Elad Amit
Elad Amit

Reputation: 605

The problem is that you need something which tells when your service was actually up vs. whether the node was up :)
We use the following (I hope one will help or the general idea of each):
1. When we look at a host we use node_time{...} - node_boot_time{...}
2. When we look at a specific process / container (docker via cadvisor in our case) we use node_time{...} - on(instance) group_right container_start_time_seconds{name=~"..."}) by(name,instance)

Upvotes: 7

Related Questions