Finding the 'Outliers' in numeric data set

Question

I want to compare (sorty by) growth rates and disadvantage high rates with very low starting values.

Example:

1. Start: 1.000.000 End: 1.100.000 Growth: +10%

Start: 100.000 End: 120.000 Growth: +20%

3. Start: 1 End: 10 Growth: +900%

Start: 10 End: 15 Growth: +50%

Sorting just by growth, descending would result in: 900% (3.), 50% (4.), 20% (2.), 10% (1.)

But I want to have: 20% (2.), 10% (1.), 900% (3.), 50% (4.), because in my case the chance is high, that 3. and 4. are statistical outliers.

What's the best way to solve this problem and do I've to define a threshold for the start values?

Thanks!

Alexander Bell · Accepted Answer

Based on the description you have provided, the problem can be split into 2:

Finding and excluding Statistical Outliers from the data set
Sorting the resulting values in descending (or just in any) order

The general solution to the first problem and example using Microsoft Excel is described at : Statistical Outliers detection in Microsoft Excel worksheet (http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection). Following is a bit of theory and a sample pertinent to your case.

Finding "Outliers" in a data set could be done by calculating the deviation for each number, expressed as either a "Z-score" or "modified Z-score" and testing it against certain predefined threshold. Z-score typically refers to number of standard deviation relative to the statistical average (in other words, it's measured in "Sigmas"). Modified Z-score applies the median computation technique to measure the deviation and in many cases provides more robust statistical detection of outliers. Mathematically the Modified Z-score could be written (as suggested by Iglewicz and Hoaglin - see the referenced article) as:

Mi = 0.6745 * (Xi - Median(Xi)) / MAD,

where MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier". Modified Z-score could be used to detect outliers in Microsoft Excel worksheet pertinent to your case as described below.

Step 1. Open a Microsoft Excel worksheet and in Cells A1, A2, A3 and A4 enter the values: 900%, 50% 20% and 10%, correspondingly.

Step 2. In C1 enter the formula: =MEDIAN(A1:A4) . The value in this cell corresponds to the median calculated on a data set entered at step 1.

Step 3. In C2 enter the array formula: {=MEDIAN(ABS(MEDIAN(A1:A4)-A1:A4))} . As a reminder, in order to enter the array formula, select the cell, type the formula in Excel Formula Bar and then click on the combination: CTRL-SHIFT-ENTER (notice the curly brackets surrounding the expression, which indicates the array formula). The value in this cell (C2) corresponds to MAD.

Step 4. Enter the formula: =IF((0.6745*ABS(C$1-A1)>3.5*C$2), "OUTLIER", "NORMAL") in the first row of column B and extend it down to the 4th row. Final result of “Outlier’s detection” should appear in column B.

A       B           C
900%    OUTLIER     35%
50%     NORMAL      0.35
20%     NORMAL  
10%     NORMAL

thus the value 900% is found an "Outlier" while other values are OK. Sorting the result set will be just a trivial task.

Excel Worksheet example is included for the clarity of explanation. The algorithm itself could be implemented in any programming languages (VBA, C#, Java, etc). Hope this will help.

Finding the 'Outliers' in numeric data set

Answers (2)

Related Questions

Finding the &#39;Outliers&#39; in numeric data set

Answers (2)

Related Questions

Finding the 'Outliers' in numeric data set