user1912925
user1912925

Reputation: 781

Remove certain outliers from matlab boxplot

Within MATLAB the boxplot command can be used to generate boxplots. The default behaviour of this function is to have a whisker length of 1.5 * IQR (75th percentile - 25th percentile) and this whisker length can be changed to another multiple of IQR if needed. However, it is not possible to use specific percentiles as the limit of the whiskers and this is something I require (in my case the 10th and 90th percentiles). As you will see in the following example I have come so far but have encountered a problem.

Consider the following data:

    Box_Data_PFCA = [-3;1;2;3;4;5;5;5;6;40;45;77;7;9;1;2;3;7;7;7;10;11;11;40;30;101;110;150];
    label = ['PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';'PFOS';...
             'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA';'PFDA'];

From which I generate a boxplot using the defualt matlab function:

h = boxplot(Box_Data_PFCA,label)

I then calculate the percentiles I require to generate a boxplot:

PFOS_10=prctile([-3;1;2;3;4;5;5;5;6;40;45;77;7;9],10)
PFOS_25=prctile([-3;1;2;3;4;5;5;5;6;40;45;77;7;9],25)
PFOS_75=prctile([-3;1;2;3;4;5;5;5;6;40;45;77;7;9],75)
PFOS_90=prctile([-3;1;2;3;4;5;5;5;6;40;45;77;7;9],90)
PFDA_10=prctile([1;2;3;7;7;7;10;11;11;40;30;101;110;150],10)
PFDA_25=prctile([1;2;3;7;7;7;10;11;11;40;30;101;110;150],25)
PFDA_75=prctile([1;2;3;7;7;7;10;11;11;40;30;101;110;150],75)
PFDA_90=prctile([1;2;3;7;7;7;10;11;11;40;30;101;110;150],90)

I then proceed to edit the box plot using the graphical handles (editing the box in my case is unnessary since the default settings of 25% to 75% suit me but I show it for completeness):

set(h(5,1), 'YData', [PFOS_25 PFOS_75 PFOS_75 PFOS_25 PFOS_25])
set(h(1,1), 'YData', [PFOS_75 PFOS_90])
set(h(2,1), 'YData', [PFOS_10 PFOS_25])
set(h(3,1), 'YData', [PFOS_90 PFOS_90])
set(h(4,1), 'YData', [PFOS_10 PFOS_10])
set(h(5,2), 'YData', [PFDA_25 PFDA_75 PFDA_75 PFDA_25 PFDA_25])
set(h(1,2), 'YData', [PFDA_75 PFDA_90])
set(h(2,2), 'YData', [PFDA_10 PFDA_25])
set(h(3,2), 'YData', [PFDA_90 PFDA_90])
set(h(4,2), 'YData', [PFDA_10 PFDA_10])

which results in the following:

enter image description here

As you can see some of my outliers overlap with the whiskers following the changes I have made to the whiskers.

My question is how can I ensure that outliers inside my whiskers are removed (and outside of my whiskers are shown) following my changes. I realise I need to use the 'Outliers' handle somehow but the solution is not presenting itself to me... Since this is only an example dataset the solution would have to work on large datasets.

Upvotes: 0

Views: 2963

Answers (2)

Gelliant
Gelliant

Reputation: 1845

So if a point is smaller than your highest whisker, and bigger than the lowest, you would remove them.

Can't you just check their position from your h variable. something like this:

idx = find(h(6:end,1).YData<PFOS_90&h(6:end,1).YData>PFOS_10);
h(5+idx,1)=[];

[edit]

Glad the idea above pointed you to a working solution! Your code is a bit long, but very readable. That's important as well. But maybe these four lines do the same job?

idx = find(h(7,1).YData<PFOS_90&h(7,1).YData>PFOS_10);
h(7,1).YData(idx)=[];h(7,1).XData(idx)=[];
idx = find(h(7,2).YData<PFOS_90&h(7,2).YData>PFOS_10);
h(7,2).YData(idx)=[];h(7,2).XData(idx)=[];

Can it be that if you have many points to remove you need to check more than only (7,1) and (7,2)? In that case, place a loop with for i = 1:size(h,2)

[/edit]

Upvotes: 1

user1912925
user1912925

Reputation: 781

Following @Gelliants tip I have managed to figure out a solution. It is not pretty and can no doubt be more refined but it works. I add the following lines of code to those posted in my question:

a = get(h(7,1), 'YData')
b = get(h(7,1), 'XData')
idx = find(a<PFOS_90&a>PFOS_10)
a(idx)=[]
b(idx)=[]
set(h(7,1), 'YData', a, 'XData', b)
e = get(h(7,2), 'YData')
f = get(h(7,2), 'XData')
idx = find(e<PFDA_90&e>PFDA_10)
e(idx)=[]
f(idx)=[]
set(h(7,2), 'YData', e, 'XData', f)

This results in the plot below which can be compared to my original in the question. Any tips on how to refine my solution are welcome!

enter image description here

Upvotes: 0

Related Questions