Reputation: 23
I am currently trying to remove outliers in R in a very easy way. I know there are functions you can create on your own for this but I would like some input on this simple code and why it does not seem to work?
outliers <- boxplot(okt$pris)$out
okt_no_out <- okt[-c(outliers),]
boxplot(okt_no_out$pris)
The first row I create a vector with the outliers, the second I create a new dataframe omitting the values in that vector. But... When I check the new dataframe only about 400 of the 750 outliers were removed?
The vector outliers contain roughly 750 rows, but when doing this it only remove about half of them....
Should not these simple lines of code remove the outliers in a very convenient way?
Upvotes: 2
Views: 12905
Reputation: 21
In your code c(outliers)
is the vector of outliers, not the row numbers; so when you add it inside [ ]
for indexing, it doesn't delete the rows in which the outliers are on. On the other hand, -c(which(okt$pris %in% outliers))
returns the row numbers of which the outliers are on. Hope this helps!
#filter outliers
outliers <- boxplot(okt$pris)$out
#drop the rows containing outliers
okt_no_out <- okt[-c(which(okt$pris %in% outliers)),]
#boxplot without outliers
boxplot(okt_no_out$pris)
Upvotes: 2
Reputation: 24069
boxplot$out
is returning the values for the outliers and not the positions of the outliers. So okt[-c(outliers),]
is removing random points in the data series, some of them are outliers and others are not.
What you can do is use the output from the boxplot's stats information to retrieve the end of the upper and lower whiskers and then filter your dataset using those values. See the example below:
#test data
testdata<-iris$Sepal.Width
#return boxplot object
b<-boxplot(testdata)
#find extremes from the boxplot's stats output
lowerwhisker<-b$stats[1]
upperwhisker<-b$stats[5]
#remove the extremes
testdata<-testdata[testdata>lowerwhisker & testdata<upperwhisker]
#replot
b<-boxplot(testdata)
Upvotes: 4