kaksat
kaksat

Reputation: 729

Function: sapply in apply, removing outliers

I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.

rm.outlier <- function(data){

  apply(data, 2, function(var) {
      sigma3.plus <- mean(var) + 3 * sd(var) 
      sigma3.min <- mean(var) - 3 * sd(var)
      sapply(var, function(y) {
        if (y > sigma3.plus){
          y <- sigma3.plus
        } else if (y < sigma3.min){
          y <- sigma3.min
        } else {y <- y}
      })
    })
    as.data.frame(data)
}

In order to check if the function works I wrote a short test:

set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a

As a result, I get:

[1] 12

So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:

a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)

Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.

Upvotes: 2

Views: 931

Answers (2)

sgibb
sgibb

Reputation: 25736

If runtime is important for you, you might consider another approach. You could vectorize this filtering, e.g. by using pmin and pmax which is equally readable and > 15x times faster. If you like it a little bit more complex you could use findInterval and get even more speed:

rm.outlier2 <- function(x) {
  ## calculate -3/3 * sigma borders
  s <- mean(x) + c(-3, 3) * sd(x)
  pmin(pmax(x, s[1]), s[2])
}

rm.outlier3 <- function(x) {
  ## calculate -3/3 * sigma borders
  s <- mean(x) + c(-3, 3) * sd(x)
  ## sorts x into intervals 0 == left of s[1], 2 == right of s[2], 1
  ## between both s
  i <- findInterval(x, s)
  ## which values are left/right of the interval
  j <- which(i != 1L)
  ## add a value between s to directly use output of findInterval for subsetting
  s2 <- c(s[1], 0, s[2])
  ## replace all values that are left/right of the interval
  x[j] <- s2[i[j] + 1L]
  x
}

Benchmarking the stuff:

## slightly modified OP version
rm.outlier <- function(x) {
  sigma3 <- mean(x) + c(-3,3) * sd(x)
  sapply(x, function(y) {
    if (y > sigma3[2]){
      y <- sigma3[2]
    } else if (y < sigma3[1]){
      y <- sigma3[1]
    } else {y <- y}
  })
}

set.seed(123)
a <- rnorm(10000, 0, 1)

# check output
all.equal(rm.outlier(a), rm.outlier2(a))
all.equal(rm.outlier2(a), rm.outlier3(a))

library("rbenchmark")

benchmark(rm.outlier(a), rm.outlier2(a), rm.outlier3(a),
          order = "relative",
          columns = c("test", "replications", "elapsed", "relative"))
#            test replications elapsed relative
#3 rm.outlier3(a)          100   0.028    1.000
#2 rm.outlier2(a)          100   0.102    3.643
#1  rm.outlier(a)          100   1.825   65.179

Upvotes: 3

Joachim Schork
Joachim Schork

Reputation: 2147

It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)

rm.outlier <- function(data){

  # Assign the result to a new dataframe
  data_new <- apply(data, 2, function(var) {
    sigma3.plus <- mean(var) + 3 * sd(var) 
    sigma3.min <- mean(var) - 3 * sd(var)
    sapply(var, function(y) {
      if (y > sigma3.plus){
        y <- sigma3.plus
      } else if (y < sigma3.min){
        y <- sigma3.min
      } else {y <- y}
    })
  })

  # Print the new dataframe
  as.data.frame(data_new)
}

set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers

# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)

Upvotes: 1

Related Questions