Reputation: 729
I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.
rm.outlier <- function(data){
apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
as.data.frame(data)
}
In order to check if the function works I wrote a short test:
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a
As a result, I get:
[1] 12
So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:
a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)
Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.
Upvotes: 2
Views: 931
Reputation: 25736
If runtime is important for you, you might consider another approach. You could vectorize this filtering, e.g. by using pmin
and pmax
which is equally readable and > 15x times faster. If you like it a little bit more complex you could use findInterval
and get even more speed:
rm.outlier2 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
pmin(pmax(x, s[1]), s[2])
}
rm.outlier3 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
## sorts x into intervals 0 == left of s[1], 2 == right of s[2], 1
## between both s
i <- findInterval(x, s)
## which values are left/right of the interval
j <- which(i != 1L)
## add a value between s to directly use output of findInterval for subsetting
s2 <- c(s[1], 0, s[2])
## replace all values that are left/right of the interval
x[j] <- s2[i[j] + 1L]
x
}
Benchmarking the stuff:
## slightly modified OP version
rm.outlier <- function(x) {
sigma3 <- mean(x) + c(-3,3) * sd(x)
sapply(x, function(y) {
if (y > sigma3[2]){
y <- sigma3[2]
} else if (y < sigma3[1]){
y <- sigma3[1]
} else {y <- y}
})
}
set.seed(123)
a <- rnorm(10000, 0, 1)
# check output
all.equal(rm.outlier(a), rm.outlier2(a))
all.equal(rm.outlier2(a), rm.outlier3(a))
library("rbenchmark")
benchmark(rm.outlier(a), rm.outlier2(a), rm.outlier3(a),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#3 rm.outlier3(a) 100 0.028 1.000
#2 rm.outlier2(a) 100 0.102 3.643
#1 rm.outlier(a) 100 1.825 65.179
Upvotes: 3
Reputation: 2147
It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)
rm.outlier <- function(data){
# Assign the result to a new dataframe
data_new <- apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
# Print the new dataframe
as.data.frame(data_new)
}
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers
# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)
Upvotes: 1