Eghbal
Eghbal

Reputation: 3803

Statistical outlier detection in MATLAB

Suppose that we have this matrix :

main = [10000   5   3   1;
5   5677    0   134;
1   1   456 3];

This method the most widely used method in econometrics and statistical problems.X is our data that we're searching for outliers in it.

X-mean(X)>= n*std(X)

So If this Inequality was true, That sample is outlier otherwise We will keep the sample.

Now my question. I want find outliers with these codes:

meann = mean(main);
stdd = std(main);
out = find(main-repmat(meann,size(main,1),1)>=repmat(2*stdd,size(main,1),1));

We are searching outliers in every column. Out should indicate index of outliers. In final step We should remove outliers in every column. Is any simpler function or method to do this in MAtLAB?

Thanks.

Upvotes: 2

Views: 11409

Answers (3)

Yvon
Yvon

Reputation: 2993

Use a cell array if you want to remove certain elements from different columns.

main = rand(100,4);
main(10,1) = 10000;
main(40,2) = 4321;
main([10,20,30],3)=[938;10;4];


mu = num2cell(mean(main));
sig = num2cell(std(main));

m = num2cell(main,1);
ind = cellfun(@(x,m,s) find( bsxfun(@lt, abs( bsxfun(@minus,x,m) ), 2*s) ),...
    m, mu, sig, 'uni', 0);
data = cellfun(@(x,m,s) x( bsxfun(@lt, abs( bsxfun(@minus,x,m) ), 2*s) ),...
    m, mu, sig, 'uni', 0);

ps. your example is too small in size so there might be not enough samples to establish a threshold.

Upvotes: 2

Dan
Dan

Reputation: 45752

If you want to find 2 standard deviations away from the mean on a per column basis I would use bsxfun rather than repmat like this:

meann = mean(main)
stdd = std(main)

I = bsxfun(@gt, abs(bsxfun(@minus, main, meann)), 2*stdd)

I would stop at I as this will allow you to remove outliers. However you can call find it you like:

out = find(I)

Although to me it is more intuitive to do this:

I = bsxfun(@lt, meann + 2*stdd, main) | bsxfun(@gt, meann - 2*stdd, main)

I think your repmat solution is missing an abs btw

Upvotes: 3

Jommy
Jommy

Reputation: 1105

A 2*sigma criterion is certainly simple, but the mean and the standard deviation are really sensitive to outliers. It follows that the out variable will thus be influenced, and in fact your code doesn't find any outlier in the given matrix.

To detect the outliers you can simply compare the values appearing in your matrix against the median, or adopt more refined criteria. There is a beautiful lecture explaining this at https://www.mne.psu.edu/me345/Lectures/outliers.pdf

Upvotes: 4

Related Questions