Reputation: 46
I have a function to delete outliers detectaOutliers()
, but somehow my function does not delete all outliers.
Can somebody help me to find the mistake?
detectaOutliers = function(x) {
q = quantile(x, probs = c(0.25, 0.75))
R = IQR(x)
OM1 = q[1] - (R * 1.5) # outliers moderados
OM3 = q[2] + (R * 1.5)
OE1 = q[1] - (R * 3) # outliers extremos
OE3 = q[2] + (R * 3)
moderados = ifelse(x < OM1 | x > OM3, 1, 0)
extremos = ifelse(x < OE1 | x > OE3, 1, 0)
cbind(extOut = moderados)
}
cepas = unique(AbsExtSin$Cepa)
concs = unique(AbsExtSin$Concen)
outliers = NULL
for (cepa in cepas) {
for (concen in concs) {
datosOE = subset(AbsExtSin, Cepa == cepa & Concen == concen)
outs = detectaOutliers(datosOE$Abs)
datosOE = cbind(datosOE, outs)
outliers = rbind(outliers, datosOE)
}
}
AbsExtSin = subset(outliers, extOut == 0)[, 1:5]
This is the data without outliers (I deleted 11 outliers, but I have more)
Upvotes: 1
Views: 857
Reputation: 46
6 hours later, I realized that the error was in the variables I was using (my database has 4 variables and I needed to remove the outliers of a column alone, depending on two others and it turns out that I was wrong with the 2 I chose) Finally, I realized and the function works perfectly!
I feel the inconvenience and thank you very much to all
Upvotes: 0
Reputation: 123
Answer: I assume that your problem is the following: First, you detect outliers (just like the boxplot function) and remove them. Afterwards, you produce boxplots with the cleaned data, which again shows outliers. And you expect to see no outliers.
This is not necessarily an error of your code, this is an error in your expectations. When you remove the outliers, the statistics of your data set change. For example, the quartiles are not the same anymore. Hence, you might identify "new" outliers. See the following example:
## create example data
set.seed(12345)
rand <- rexp(100,23)
## plot. gives outliers.
boxplot(rand)
## detect outliers with these functions
detectaOutliers = function(x) {
q = quantile(x, probs = c(0.25, 0.75))
R = IQR(x)
OM1 = q[1] - (R * 1.5) # outliers moderados
OM3 = q[2] + (R * 1.5)
OE1 = q[1] - (R * 3) # outliers extremos
OE3 = q[2] + (R * 3)
moderados = ifelse(x < OM1 | x > OM3, 1, 0)
extremos = ifelse(x < OE1 | x > OE3, 1, 0)
cbind(extOut = moderados)
}
detectOut <- function(x) boxplot(x, plot = FALSE)$out
## clean your data
clean1 <- rand[!as.logical(detectaOutliers(rand))]
clean2 <- rand[!rand%in%detectOut(rand)]
## check that these functions do the same.
all(clean1 == clean2 )
# Fun fact: depending on your data, clean1 and clean2
# are not always the same. See the extra note below.
## plot cleaned data
boxplot(clean2)
## Still has outliers. But "new" ones. confirm with:
sort(boxplot(rand)$out) # original outlier
sort(boxplot(clean2)$out) # new outlier
Note 1: Your code does not necessarily use the same outlier identification as the boxplot function in R (I am not sure about the ggplot boxplot, but this is at least true for the graphics::boxplot function.):
## The boxplot function (rather: boxplot.stats)
## does not use the quantile function, but the fivenum function
## to identify outliers. They produce different results, e.g., here:
fivenum(rand)[c(2,4)]
quantile(rand,probs=c(0.25,0.75))
Note 2:
If you want boxplots that exclude outliers, you can use the outline
parameter of the boxplot function (for ggplot, see Ignore outliers in ggplot2 boxplot)
Upvotes: 2