Function: sapply in apply, removing outliers

Question

I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.

rm.outlier <- function(data){

  apply(data, 2, function(var) {
      sigma3.plus <- mean(var) + 3 * sd(var) 
      sigma3.min <- mean(var) - 3 * sd(var)
      sapply(var, function(y) {
        if (y > sigma3.plus){
          y <- sigma3.plus
        } else if (y < sigma3.min){
          y <- sigma3.min
        } else {y <- y}
      })
    })
    as.data.frame(data)
}

In order to check if the function works I wrote a short test:

set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a

As a result, I get:

[1] 12

So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:

a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)

Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.

Joachim Schork · Accepted Answer

It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)

rm.outlier <- function(data){

  # Assign the result to a new dataframe
  data_new <- apply(data, 2, function(var) {
    sigma3.plus <- mean(var) + 3 * sd(var) 
    sigma3.min <- mean(var) - 3 * sd(var)
    sapply(var, function(y) {
      if (y > sigma3.plus){
        y <- sigma3.plus
      } else if (y < sigma3.min){
        y <- sigma3.min
      } else {y <- y}
    })
  })

  # Print the new dataframe
  as.data.frame(data_new)
}

set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers

# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)

Function: sapply in apply, removing outliers

Answers (2)

Related Questions