Is standard deviation (STDDEV) the right function for the job?

Question

We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).

The agents report every measure they take to a central database where these "observations" are stored.

For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:

(unprocessed_files, 41)

That value is constanty being written to our DB (among many others, as explained above).

We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?

For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.

So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?

ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?

We would like to notify with a red color when something is not going well.

David Z · Accepted Answer

Standard deviation is just a way of characterizing how much a set of values spreads away from its average (i.e. mean). In a sense, it's an "average deviation from average", though a little more complicated than that. It is true that values which differ from the mean by many times the standard deviation tend to be rare, but that doesn't mean the standard deviation is a good benchmark for identifying anomalous values that might indicate something is wrong.

For one thing, if you set your acceptable range at the average plus or minus one standard deviation, you're probably going to get very frequent results outside that range! You could use the average plus or minus two standard deviations, or three, or however many you want to reduce the number of notifications/error conditions as low as you want, but there's no telling whether any of this actually helps you identify error conditions.

I think your main problem is not statistics. Your problem is that you don't know what kinds of results actually indicate an error. So before you program in any acceptable range, just let the system run for a while and collect some calibration data showing what kinds of values you see when it's running normally, and what kinds of values you see when it's not running normally. Make sure you have some way to tell which are which. Once you have a good amount of data for both conditions, you can analyze it (start with a simple histogram) and see what kinds of values are characteristic of normal operation and what kinds are characteristics of error conditions. Then you can set your acceptable range based on that.

If you want to get fancy, there is a statistical technique called likelihood ratio testing that can help you evaluate just how likely it is that your system is working properly. But I think it's probably overkill. Monitoring systems don't need to be super-precise about this stuff; just show a cautionary notice whenever the readings start to seem abnormal.

Is standard deviation (STDDEV) the right function for the job?

Answers (2)

Related Questions