Stephen H. Anderson
Stephen H. Anderson

Reputation: 1068

Is standard deviation (STDDEV) the right function for the job?

We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).

The agents report every measure they take to a central database where these "observations" are stored.

For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:

(unprocessed_files, 41)

That value is constanty being written to our DB (among many others, as explained above).

We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?

For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.

So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?

ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?

We would like to notify with a red color when something is not going well.

Upvotes: 2

Views: 826

Answers (2)

O. Jones
O. Jones

Reputation: 108736

For your backlog (unprocessed file) metric, using a standard deviation to know when to sound an alarm (turn something red) is going to drive you crazy with false alarms.

Why? most of the time your backlog will be zero. So, the standard deviation will also be very close to zero. Standard deviation tells you how much your metric varies. Therefore, whenever you get a nonzero backlog, it will be outside the avg + stdev range.

For a backlog, you may want to turn stuff yellow when the value is > 1 and red when the value is > 10.

If you have a "how long did it take" metric, standard deviation might be a valid way to identify alarm conditions. For example, you might have a web request that usually takes about half a second, but typically varies from 0.25 to 0.8 second. If they suddenly start taking 2.5 seconds, then you know something has gone wrong.

Standard deviation is a measurement that makes most sense for a normal distribution (bell curve distribution). When you handle your measurements as if they fit a bell curve, you're implicitly making the assumption that each measurement is entirely independent of the others. That assumption works poorly for typical metrics of a computing system (backlog, transaction time, load average, etc). So, using stdev is OK, but not great. You'll probably struggle to make sense of stdev numbers: that's because they don't actually make much sense.

You'd be better off, like @duffymo suggested, looking at the 95th percentile (the worst-performing operations). But MySQL doesn't compute those kinds of distributions natively. postgreSQL does. So does Oracle Standard Edition and higher.

How do you determine an out-of-bounds metric? It depends on the metric, and on what you're trying to do. If it's a backlog measurement, and it grows from minute to minute, you have a problem to investigate. If it's a transaction time, and it's far longer than average (avg + 3 x stdev, for example, you have a problem. The open source monitoring system Nagios has worked this out for various kinds of metrics.

Read a book by N. N. Taleb called "The Black Swan" if you want to know how assuming the real world fits normal distributions can crash the global economy.

Upvotes: 2

David Z
David Z

Reputation: 131640

Standard deviation is just a way of characterizing how much a set of values spreads away from its average (i.e. mean). In a sense, it's an "average deviation from average", though a little more complicated than that. It is true that values which differ from the mean by many times the standard deviation tend to be rare, but that doesn't mean the standard deviation is a good benchmark for identifying anomalous values that might indicate something is wrong.

For one thing, if you set your acceptable range at the average plus or minus one standard deviation, you're probably going to get very frequent results outside that range! You could use the average plus or minus two standard deviations, or three, or however many you want to reduce the number of notifications/error conditions as low as you want, but there's no telling whether any of this actually helps you identify error conditions.

I think your main problem is not statistics. Your problem is that you don't know what kinds of results actually indicate an error. So before you program in any acceptable range, just let the system run for a while and collect some calibration data showing what kinds of values you see when it's running normally, and what kinds of values you see when it's not running normally. Make sure you have some way to tell which are which. Once you have a good amount of data for both conditions, you can analyze it (start with a simple histogram) and see what kinds of values are characteristic of normal operation and what kinds are characteristics of error conditions. Then you can set your acceptable range based on that.

If you want to get fancy, there is a statistical technique called likelihood ratio testing that can help you evaluate just how likely it is that your system is working properly. But I think it's probably overkill. Monitoring systems don't need to be super-precise about this stuff; just show a cautionary notice whenever the readings start to seem abnormal.

Upvotes: 1

Related Questions