Jai
Jai

Reputation: 61

Box-plot R calculating outliers

> summary(mydata)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0      93     107     110     125     197 
> range=1.5*(125-93)
> upper_whisker=125+range
> lower_whisker=93-range
> upper_whisker
[1] 173
> lower_whisker
[1] 45
> boxplot(mydata)$stats
  [,1]
[1,]   56  #Lower whisker by boxplot
[2,]   93   
[3,]  107
[4,]  125
[5,]  173

I tried looking up the formula for calculating after and before what values are the points to be considered outliers It was

 Above =>3rd Qu +(3rd Qu - 1st Qu)*1.5
 Below =>1st Qu -(3rd Qu - 1st Qu)*1.5

For some reason they don't seem to match with the stats returned by boxplot function in R I have a feeling it's something silly here

Are they calculated differently? Or am I reading the wrong answer from boxplot?

Edit:

I've used https://www.kaggle.com/uciml/pima-indians-diabetes-database and ran

mydata=raw$Glucose[raw$Outcome==0]

EDIT2:The Box-plot

I suppose if

#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker

is returning min(x), there shouldn't be any outliers and min(mydata) is 0

Edit 3: Clearer view of Quantile

quantile(mydata)
0%  25%  50%  75% 100% 
0   93  107  125  197 

Edit 4: Added vector as requested

c(85L, 89L, 116L, 115L, 110L, 139L, 103L, 126L, 99L, 97L, 145L, 
117L, 109L, 88L, 92L, 122L, 103L, 138L, 180L, 133L, 106L, 159L, 
146L, 71L, 105L, 103L, 101L, 88L, 150L, 73L, 100L, 146L, 105L, 
84L, 44L, 141L, 99L, 109L, 95L, 146L, 139L, 129L, 79L, 0L, 62L, 
95L, 112L, 113L, 74L, 83L, 101L, 110L, 106L, 100L, 107L, 80L, 
123L, 81L, 142L, 144L, 92L, 71L, 93L, 151L, 125L, 81L, 85L, 126L, 
96L, 144L, 83L, 89L, 76L, 78L, 97L, 99L, 111L, 107L, 132L, 120L, 
118L, 84L, 96L, 125L, 100L, 93L, 129L, 105L, 128L, 106L, 108L, 
154L, 102L, 57L, 106L, 147L, 90L, 136L, 114L, 153L, 99L, 109L, 
88L, 151L, 102L, 114L, 100L, 148L, 120L, 110L, 111L, 87L, 79L, 
75L, 85L, 143L, 87L, 119L, 0L, 73L, 141L, 111L, 123L, 85L, 105L, 
113L, 138L, 108L, 99L, 103L, 111L, 96L, 81L, 147L, 179L, 125L, 
119L, 142L, 100L, 87L, 101L, 197L, 117L, 79L, 122L, 74L, 104L, 
91L, 91L, 146L, 122L, 165L, 124L, 111L, 106L, 129L, 90L, 86L, 
111L, 114L, 193L, 191L, 95L, 142L, 96L, 128L, 102L, 108L, 122L, 
71L, 106L, 100L, 104L, 114L, 108L, 129L, 133L, 136L, 155L, 96L, 
108L, 78L, 161L, 151L, 126L, 112L, 77L, 150L, 120L, 137L, 80L, 
106L, 113L, 112L, 99L, 115L, 129L, 112L, 157L, 179L, 105L, 118L, 
87L, 106L, 95L, 165L, 117L, 130L, 95L, 0L, 122L, 95L, 126L, 139L, 
116L, 99L, 92L, 137L, 61L, 90L, 90L, 88L, 158L, 103L, 147L, 99L, 
101L, 81L, 118L, 84L, 105L, 122L, 98L, 87L, 93L, 107L, 105L, 
109L, 90L, 125L, 119L, 100L, 100L, 131L, 116L, 127L, 96L, 82L, 
137L, 72L, 123L, 101L, 102L, 112L, 143L, 143L, 97L, 83L, 119L, 
94L, 102L, 115L, 94L, 135L, 99L, 89L, 80L, 139L, 90L, 140L, 147L, 
97L, 107L, 83L, 117L, 100L, 95L, 120L, 82L, 91L, 119L, 100L, 
135L, 86L, 134L, 120L, 71L, 74L, 88L, 115L, 124L, 74L, 97L, 154L, 
144L, 137L, 119L, 136L, 114L, 137L, 114L, 126L, 132L, 123L, 85L, 
84L, 139L, 173L, 99L, 194L, 83L, 89L, 99L, 80L, 166L, 110L, 81L, 
154L, 117L, 84L, 94L, 96L, 75L, 130L, 84L, 120L, 139L, 91L, 91L, 
99L, 125L, 76L, 129L, 68L, 124L, 114L, 125L, 87L, 97L, 116L, 
117L, 111L, 122L, 107L, 86L, 91L, 77L, 105L, 57L, 127L, 84L, 
88L, 131L, 164L, 189L, 116L, 84L, 114L, 88L, 84L, 124L, 97L, 
110L, 103L, 85L, 87L, 99L, 91L, 95L, 99L, 92L, 154L, 78L, 130L, 
111L, 98L, 143L, 119L, 108L, 133L, 109L, 121L, 100L, 93L, 103L, 
73L, 112L, 82L, 123L, 67L, 89L, 109L, 108L, 96L, 124L, 124L, 
92L, 152L, 111L, 106L, 105L, 106L, 117L, 68L, 112L, 92L, 183L, 
94L, 108L, 90L, 125L, 132L, 128L, 94L, 102L, 111L, 128L, 92L, 
104L, 94L, 100L, 102L, 128L, 90L, 103L, 157L, 107L, 91L, 117L, 
123L, 120L, 106L, 101L, 120L, 127L, 162L, 112L, 98L, 154L, 165L, 
99L, 68L, 123L, 91L, 93L, 101L, 56L, 95L, 136L, 129L, 130L, 107L, 
140L, 107L, 121L, 90L, 99L, 127L, 118L, 122L, 129L, 110L, 80L, 
127L, 158L, 126L, 134L, 102L, 94L, 108L, 83L, 114L, 117L, 111L, 
112L, 116L, 141L, 175L, 92L, 106L, 105L, 95L, 126L, 65L, 99L, 
102L, 109L, 153L, 100L, 81L, 121L, 108L, 137L, 106L, 88L, 89L, 
101L, 122L, 121L, 93L)

Upvotes: 4

Views: 5609

Answers (3)

Tomomi Landsman
Tomomi Landsman

Reputation: 1

I realize this is very old, but just in case:

The calculated lower limit where anything less is an outlier is 45, which is different from the extent of the lower whisker of your boxplot (shown by boxplot stats). The lowest value in your dataset that is equal to or greater than 45 is 56, which becomes the extent of the lower whisker of your boxplot (anything lower is an outlier). If you had had another value in your dataset between 45 and 56, that would be the extent of the lower whisker of your boxplot.

Likewise, if you did not have the value of 173 in your dataset, the upper whisker value would change as well, though the threshold for an outlier would not (provided the IQR and quartiles didn't change).

Upvotes: 0

RLave
RLave

Reputation: 8364

Your calculation was almost right, R uses this:

#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker
#min(max(x), Q3 + (IQR(x)*1.5)) #upper whisker

That's why, it picks the max/min between the min(x)/max(x), and the standard formula.

Here an example:

my_data <- mtcars$mpg

bp <- boxplot(my_data)
bp$stats
# [1,] 10.40 # lower whisker
# [2,] 15.35
# [3,] 19.20 # == median(my_data)
# [4,] 22.80
# [5,] 33.90 # upper whisker


max(min(my_data,na.rm=T), as.numeric(quantile(my_data, 0.25)) - (IQR(my_data)*1.5))
#[1] 10.4 #lower whisker
min(max(my_data,na.rm=T), as.numeric(quantile(my_data, 0.75)) + (IQR(my_data)*1.5))
#[1] 33.9 # upper whisker

Upvotes: 5

paoloeusebi
paoloeusebi

Reputation: 1086

I think there are few things to clarified. The first thing is that you should always provide a reproducible example for helping people to help you. An outlier is defined as a data point that is located outside the whiskers of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile). The correct way to figure out how this work is simulating some Student's T data under a pre-specified a random number generator state.

set.seed(1)
mydata <- rt(100, df = 3)
boxplot(mydata)
summary(mydata)

Boxplot

Then we can calculate the interquartile range and the lower and upper bounds for outliers according to the rule in the text above

t <- as.vector(summary(mydata))
iqr.range <- t[5]-t[2]
upper_outliers <- t[5]+iqr.range*1.5
lower_outliers <- t[2]-iqr.range*1.5

Let's check the data which are defined as outliers, while the boxplot whiskers are the data points immediately before/after the lower/upper boundaries.

 mydata[mydata<lower_outliers]
 [1] -3.527006 -2.959327 -2.754192
 mydata[mydata>upper_outliers]
 [1] 3.080302 3.527205

Upvotes: 1

Related Questions