Reputation: 465
I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:
------- ------------- ------------ ---------- -------------------
Cluster %Availability Requests/Sec Errors/Sec %Memory_Utilization
------- ------------- ------------ ---------- -------------------
ams-a 98.099 1012 678 91
bos-a 98.099 1111 12 91
bos-b 55.123 1513 576 22
lax-a 99.110 988 10 89
pdx-a 98.123 1121 11 90
ord-b 75.005 1301 123 100
sjc-a 99.020 1000 10 88
...(so on)...
So in list form, it might look like:
[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]
My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).
Thanks in advance!
Upvotes: 17
Views: 18573
Reputation: 8736
I think your best bet is to have a look into the scipy's scoreatpercentile function. So for instance you could try excluding all the values that are above the 99th percentile.
Mean and standard deviation are no good if you don't have a normal distribution.
Generally it's good to have a rough visual idea of what your data looks like. There is matplotlib; I recommend you make some plots of your data with it before deciding on a plan.
Upvotes: 5
Reputation: 20282
One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a boxplot
function to do just that.
One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.
Upvotes: 8
Reputation: 31741
Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.
As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.
On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.
For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.
Upvotes: 7
Reputation: 3189
You need to calculate the Mean (Average) and Standard Deviation for the column. Stadard deviation is a bit confusing, but the important fact is that 2/3 of the data is within
Mean +/- StandardDeviation
Generally anything outside Mean +/- 2 * StandardDeviation is an outlier, but you can tweak the multiplier.
http://en.wikipedia.org/wiki/Standard_deviation
So to be clear, you want to convert the data to standard deviations from the mean.
ie
def getdeviations(x, mean, stddev):
return math.abs(x - mean) / stddev
Numpy has functions for this.
Upvotes: 1