Reputation: 61
> summary(mydata)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 93 107 110 125 197
> range=1.5*(125-93)
> upper_whisker=125+range
> lower_whisker=93-range
> upper_whisker
[1] 173
> lower_whisker
[1] 45
> boxplot(mydata)$stats
[,1]
[1,] 56 #Lower whisker by boxplot
[2,] 93
[3,] 107
[4,] 125
[5,] 173
I tried looking up the formula for calculating after and before what values are the points to be considered outliers It was
Above =>3rd Qu +(3rd Qu - 1st Qu)*1.5
Below =>1st Qu -(3rd Qu - 1st Qu)*1.5
For some reason they don't seem to match with the stats returned by boxplot function in R I have a feeling it's something silly here
Are they calculated differently? Or am I reading the wrong answer from boxplot?
Edit:
I've used https://www.kaggle.com/uciml/pima-indians-diabetes-database and ran
mydata=raw$Glucose[raw$Outcome==0]
I suppose if
#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker
is returning min(x), there shouldn't be any outliers and min(mydata) is 0
Edit 3: Clearer view of Quantile
quantile(mydata)
0% 25% 50% 75% 100%
0 93 107 125 197
Edit 4: Added vector as requested
c(85L, 89L, 116L, 115L, 110L, 139L, 103L, 126L, 99L, 97L, 145L,
117L, 109L, 88L, 92L, 122L, 103L, 138L, 180L, 133L, 106L, 159L,
146L, 71L, 105L, 103L, 101L, 88L, 150L, 73L, 100L, 146L, 105L,
84L, 44L, 141L, 99L, 109L, 95L, 146L, 139L, 129L, 79L, 0L, 62L,
95L, 112L, 113L, 74L, 83L, 101L, 110L, 106L, 100L, 107L, 80L,
123L, 81L, 142L, 144L, 92L, 71L, 93L, 151L, 125L, 81L, 85L, 126L,
96L, 144L, 83L, 89L, 76L, 78L, 97L, 99L, 111L, 107L, 132L, 120L,
118L, 84L, 96L, 125L, 100L, 93L, 129L, 105L, 128L, 106L, 108L,
154L, 102L, 57L, 106L, 147L, 90L, 136L, 114L, 153L, 99L, 109L,
88L, 151L, 102L, 114L, 100L, 148L, 120L, 110L, 111L, 87L, 79L,
75L, 85L, 143L, 87L, 119L, 0L, 73L, 141L, 111L, 123L, 85L, 105L,
113L, 138L, 108L, 99L, 103L, 111L, 96L, 81L, 147L, 179L, 125L,
119L, 142L, 100L, 87L, 101L, 197L, 117L, 79L, 122L, 74L, 104L,
91L, 91L, 146L, 122L, 165L, 124L, 111L, 106L, 129L, 90L, 86L,
111L, 114L, 193L, 191L, 95L, 142L, 96L, 128L, 102L, 108L, 122L,
71L, 106L, 100L, 104L, 114L, 108L, 129L, 133L, 136L, 155L, 96L,
108L, 78L, 161L, 151L, 126L, 112L, 77L, 150L, 120L, 137L, 80L,
106L, 113L, 112L, 99L, 115L, 129L, 112L, 157L, 179L, 105L, 118L,
87L, 106L, 95L, 165L, 117L, 130L, 95L, 0L, 122L, 95L, 126L, 139L,
116L, 99L, 92L, 137L, 61L, 90L, 90L, 88L, 158L, 103L, 147L, 99L,
101L, 81L, 118L, 84L, 105L, 122L, 98L, 87L, 93L, 107L, 105L,
109L, 90L, 125L, 119L, 100L, 100L, 131L, 116L, 127L, 96L, 82L,
137L, 72L, 123L, 101L, 102L, 112L, 143L, 143L, 97L, 83L, 119L,
94L, 102L, 115L, 94L, 135L, 99L, 89L, 80L, 139L, 90L, 140L, 147L,
97L, 107L, 83L, 117L, 100L, 95L, 120L, 82L, 91L, 119L, 100L,
135L, 86L, 134L, 120L, 71L, 74L, 88L, 115L, 124L, 74L, 97L, 154L,
144L, 137L, 119L, 136L, 114L, 137L, 114L, 126L, 132L, 123L, 85L,
84L, 139L, 173L, 99L, 194L, 83L, 89L, 99L, 80L, 166L, 110L, 81L,
154L, 117L, 84L, 94L, 96L, 75L, 130L, 84L, 120L, 139L, 91L, 91L,
99L, 125L, 76L, 129L, 68L, 124L, 114L, 125L, 87L, 97L, 116L,
117L, 111L, 122L, 107L, 86L, 91L, 77L, 105L, 57L, 127L, 84L,
88L, 131L, 164L, 189L, 116L, 84L, 114L, 88L, 84L, 124L, 97L,
110L, 103L, 85L, 87L, 99L, 91L, 95L, 99L, 92L, 154L, 78L, 130L,
111L, 98L, 143L, 119L, 108L, 133L, 109L, 121L, 100L, 93L, 103L,
73L, 112L, 82L, 123L, 67L, 89L, 109L, 108L, 96L, 124L, 124L,
92L, 152L, 111L, 106L, 105L, 106L, 117L, 68L, 112L, 92L, 183L,
94L, 108L, 90L, 125L, 132L, 128L, 94L, 102L, 111L, 128L, 92L,
104L, 94L, 100L, 102L, 128L, 90L, 103L, 157L, 107L, 91L, 117L,
123L, 120L, 106L, 101L, 120L, 127L, 162L, 112L, 98L, 154L, 165L,
99L, 68L, 123L, 91L, 93L, 101L, 56L, 95L, 136L, 129L, 130L, 107L,
140L, 107L, 121L, 90L, 99L, 127L, 118L, 122L, 129L, 110L, 80L,
127L, 158L, 126L, 134L, 102L, 94L, 108L, 83L, 114L, 117L, 111L,
112L, 116L, 141L, 175L, 92L, 106L, 105L, 95L, 126L, 65L, 99L,
102L, 109L, 153L, 100L, 81L, 121L, 108L, 137L, 106L, 88L, 89L,
101L, 122L, 121L, 93L)
Upvotes: 4
Views: 5609
Reputation: 1
I realize this is very old, but just in case:
The calculated lower limit where anything less is an outlier is 45, which is different from the extent of the lower whisker of your boxplot (shown by boxplot stats). The lowest value in your dataset that is equal to or greater than 45 is 56, which becomes the extent of the lower whisker of your boxplot (anything lower is an outlier). If you had had another value in your dataset between 45 and 56, that would be the extent of the lower whisker of your boxplot.
Likewise, if you did not have the value of 173 in your dataset, the upper whisker value would change as well, though the threshold for an outlier would not (provided the IQR and quartiles didn't change).
Upvotes: 0
Reputation: 8364
Your calculation was almost right, R uses this:
#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker
#min(max(x), Q3 + (IQR(x)*1.5)) #upper whisker
That's why, it picks the max/min
between the min(x)/max(x)
, and the standard formula.
Here an example:
my_data <- mtcars$mpg
bp <- boxplot(my_data)
bp$stats
# [1,] 10.40 # lower whisker
# [2,] 15.35
# [3,] 19.20 # == median(my_data)
# [4,] 22.80
# [5,] 33.90 # upper whisker
max(min(my_data,na.rm=T), as.numeric(quantile(my_data, 0.25)) - (IQR(my_data)*1.5))
#[1] 10.4 #lower whisker
min(max(my_data,na.rm=T), as.numeric(quantile(my_data, 0.75)) + (IQR(my_data)*1.5))
#[1] 33.9 # upper whisker
Upvotes: 5
Reputation: 1086
I think there are few things to clarified. The first thing is that you should always provide a reproducible example for helping people to help you. An outlier is defined as a data point that is located outside the whiskers of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile). The correct way to figure out how this work is simulating some Student's T data under a pre-specified a random number generator state.
set.seed(1)
mydata <- rt(100, df = 3)
boxplot(mydata)
summary(mydata)
Then we can calculate the interquartile range and the lower and upper bounds for outliers according to the rule in the text above
t <- as.vector(summary(mydata))
iqr.range <- t[5]-t[2]
upper_outliers <- t[5]+iqr.range*1.5
lower_outliers <- t[2]-iqr.range*1.5
Let's check the data which are defined as outliers, while the boxplot whiskers are the data points immediately before/after the lower/upper boundaries.
mydata[mydata<lower_outliers]
[1] -3.527006 -2.959327 -2.754192
mydata[mydata>upper_outliers]
[1] 3.080302 3.527205
Upvotes: 1