Reputation: 607
I want to compare (sorty by) growth rates and disadvantage high rates with very low starting values.
Example:
1. Start: 1.000.000 End: 1.100.000 Growth: +10%
3. Start: 1 End: 10 Growth: +900%
Sorting just by growth, descending would result in: 900% (3.), 50% (4.), 20% (2.), 10% (1.)
But I want to have: 20% (2.), 10% (1.), 900% (3.), 50% (4.), because in my case the chance is high, that 3. and 4. are statistical outliers.
What's the best way to solve this problem and do I've to define a threshold for the start values?
Thanks!
Upvotes: 3
Views: 8369
Reputation: 7918
Based on the description you have provided, the problem can be split into 2:
Statistical Outliers
from the data setThe general solution to the first problem and example using Microsoft Excel is described at : Statistical Outliers detection in Microsoft Excel worksheet (http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection). Following is a bit of theory and a sample pertinent to your case.
Finding "Outliers" in a data set could be done by calculating the deviation for each number, expressed as either a "Z-score" or "modified Z-score" and testing it against certain predefined threshold. Z-score typically refers to number of standard deviation relative to the statistical average (in other words, it's measured in "Sigmas"). Modified Z-score applies the median computation technique to measure the deviation and in many cases provides more robust statistical detection of outliers. Mathematically the Modified Z-score could be written (as suggested by Iglewicz and Hoaglin - see the referenced article) as:
Mi = 0.6745 * (Xi - Median(Xi)) / MAD,
where MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier". Modified Z-score could be used to detect outliers in Microsoft Excel worksheet pertinent to your case as described below.
Step 1. Open a Microsoft Excel worksheet and in Cells A1, A2, A3 and A4 enter the values: 900%, 50% 20% and 10%, correspondingly.
Step 2. In C1 enter the formula: =MEDIAN(A1:A4)
. The value in this cell corresponds to the median calculated on a data set entered at step 1.
Step 3. In C2 enter the array formula: {=MEDIAN(ABS(MEDIAN(A1:A4)-A1:A4))}
. As a reminder, in order to enter the array formula, select the cell, type the formula in Excel Formula Bar and then click on the combination: CTRL-SHIFT-ENTER (notice the curly brackets surrounding the expression, which indicates the array formula). The value in this cell (C2) corresponds to MAD.
Step 4. Enter the formula: =IF((0.6745*ABS(C$1-A1)>3.5*C$2), "OUTLIER", "NORMAL")
in the first row of column B and extend it down to the 4th row. Final result of “Outlier’s detection” should appear in column B.
A B C
900% OUTLIER 35%
50% NORMAL 0.35
20% NORMAL
10% NORMAL
thus the value 900% is found an "Outlier" while other values are OK. Sorting the result set will be just a trivial task.
Excel Worksheet example is included for the clarity of explanation. The algorithm itself could be implemented in any programming languages (VBA, C#, Java, etc). Hope this will help.
Upvotes: 3
Reputation: 1927
my solition
private static List<double> StatisticalOutLierAnalysis(List<double> allNumbers)
{
List<double> normalNumbers = new List<double>();
List<double> outLierNumbers = new List<double>();
double avg = allNumbers.Average();
double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
foreach (double number in allNumbers)
{
if ((Math.Abs(number - avg)) > (2 * standardDeviation))
outLierNumbers.Add(number);
else
normalNumbers.Add(number);
}
return normalNumbers;
}
Upvotes: 3