Reputation: 11
I have Household ID's and their respective sales. As it turn out there are few of these HH ID's who have extremely high Total Sales. Can you guys please suggest a good method for the outlier treatment. It will be great if you suggest in SAS.
Regards, Saket
Upvotes: 1
Views: 4579
Reputation: 706
The following is a basic, rather crude method. It involves removing values more than 3 standard deviations from the mean:-
** Standardise data;
proc standard data=sales_data mean=0 std=1 out=sales_data_std;
var sales;
run;
** Remove values more than 3 std devs from mean;
data sales_data_no_outliers;
set sales_data_std;
where sales < -3 or sales > 3;
run;
There's a reference to this approach in Wikipedia.
Still, it's crude; it relies on your variable being normally distributed and will almost always find outliers (if n > 100) even if, in all reasonableness, the values are not really outlying.
The subject of outliers is long and detailed but a cursory overview of the topic might be useful. Unfortunately, I can't really think of any introductory sources off-hand.
Upvotes: 2