Abrar Al-Shaer
Abrar Al-Shaer

Reputation: 61

Boxplot.stats R not identifying outliers

I have used boxplot.stats$out to get outliers of a list in R. However I noticed that many times it fails to identify outliers. For example:

list = c(3,4,7,500)
boxplot.stats(list)

$`stats`
[1]   3.0   3.5   5.5 253.5 500.0

$n
[1] 4

$conf
[1] -192  203

$out
numeric(0)

quantile(list)

    0%    25%    50%    75%   100% 
  3.00   3.75   5.50 130.25 500.00 

130.25+1.5*IQR(list) = 320

As you can see the boxplot.stats() function failed to find the outlier 500, even though when I looked at the documentation they are using the Q1/Q3+/-1.5*IQR method. So 500 should've been identified as an outlier, but it clearly is not finding it and I'm not sure why?

I have tried this with a list of 5 elements instead of 4, or with an outlier that is very small instead of very large and I still get the same problem.

Upvotes: 2

Views: 1466

Answers (3)

R. Ladwein
R. Ladwein

Reputation: 31

Try this,

library (car)
Boxplot (Petal.Length ~ Species, id = list (n=Inf))

to identify all the outliers

Upvotes: 0

G5W
G5W

Reputation: 37641

Notice that the third number in the "stats" portion is 253.5, not 130.25 The documentation for boxplot.stats says:

The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise

In other words, for your data, it is using (500+7)/2 as the Q3 value
(and incidentally (3+4)/2 = 3.5 as Q1, not the 3.75 that you got from quantile). Boxplot will use the boundary 253.5 + 1.5*(253.5 - 3.5) = 628.5

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76412

If you read the help page help("boxplot.stats") carefully, the return value section says the following. My emphasis.

stats
a vector of length 5, containing the extreme of the lower
whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and
the extreme of the upper whisker.

Then, in the same section, again my emphasis.

out
the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).

Your data has 4 points. The extreme of the upper whisker, as returned in list member $stats, is 500.0, and this is the maximum of your data. There is no error.

Upvotes: 1

Related Questions