septagram
septagram

Reputation: 465

Finding outliers in a data set

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:

-------  -------------  ------------  ----------  -------------------
Cluster  %Availability  Requests/Sec  Errors/Sec  %Memory_Utilization
-------  -------------  ------------  ----------  -------------------
ams-a    98.099          1012         678          91
bos-a    98.099          1111         12           91
bos-b    55.123          1513         576          22
lax-a    99.110          988          10           89
pdx-a    98.123          1121         11           90
ord-b    75.005          1301         123          100
sjc-a    99.020          1000         10           88
...(so on)...

So in list form, it might look like:

[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]

My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).

Thanks in advance!

Upvotes: 17

Views: 18573

Answers (4)

Navi
Navi

Reputation: 8736

I think your best bet is to have a look into the scipy's scoreatpercentile function. So for instance you could try excluding all the values that are above the 99th percentile.

Mean and standard deviation are no good if you don't have a normal distribution.

Generally it's good to have a rough visual idea of what your data looks like. There is matplotlib; I recommend you make some plots of your data with it before deciding on a plan.

Upvotes: 5

Prasad Chalasani
Prasad Chalasani

Reputation: 20282

One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a boxplot function to do just that.

One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.

Upvotes: 8

David LeBauer
David LeBauer

Reputation: 31741

Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.

As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.

For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

Upvotes: 7

wmil
wmil

Reputation: 3189

You need to calculate the Mean (Average) and Standard Deviation for the column. Stadard deviation is a bit confusing, but the important fact is that 2/3 of the data is within

Mean +/- StandardDeviation

Generally anything outside Mean +/- 2 * StandardDeviation is an outlier, but you can tweak the multiplier.

http://en.wikipedia.org/wiki/Standard_deviation

So to be clear, you want to convert the data to standard deviations from the mean.

ie

def getdeviations(x, mean, stddev):
   return math.abs(x - mean) / stddev

Numpy has functions for this.

Upvotes: 1

Related Questions