datasci-iopsy
datasci-iopsy

Reputation: 345

How to detect and replace outlier from multiple columns in a single data set using R?

I am trying to find and replace outliers from multiple numeric columns. This is not the best practice in my humble opinion, but it is something I'm attempting to figure out for specific use cases. A great example of creating an additional column that labels a row as an outlier can be found here but it is based on a single column.

My data looks as follows (for simplicity, I excluded columns with factors):

   Row ID   Value1 Value2
      1        6      1
      2        2     200
      3      100      3
      4        1      4
      5      250      5
      6        2      6
      7        8     300
      8      600     300
      9        2      9

I used a function to replace outliers with NA in all numeric columns:

replaceOuts = function(df) {
    map_if(df, is.numeric, 
           ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>% 
    bind_cols 
}
test = replaceOuts(df)

My question is how can I replace the outliers with another value (e.g., mean, median, capped value, etc.)? Any help would be appreciated!

Upvotes: 0

Views: 1136

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

Instead of NA you could replace the value with mean or median whatever you prefer.

library(dplyr)
library(purrr)

replaceOuts = function(df) {
   map_if(df, is.numeric, 
          ~ replace(.x, .x %in% boxplot.stats(.x)$out, mean(.x))) %>%
   bind_cols 
}

replaceOuts(df)

# RowID Value1 Value2
#  <dbl>  <dbl>  <dbl>
#1     1     6       1
#2     2     2     200
#3     3   100       3
#4     4     1       4
#5     5   108.      5
#6     6     2       6
#7     7     8     300
#8     8   108.    300
#9     9     2       9

Replace mean with median to any other function that you want.

PS - I think it is better to use mutate_if instead of map_if here since it avoids bind_cols at the end.

df %>% mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, mean(.)))

Upvotes: 1

I think you need minVal and maxMax treshold values. And then replace values out of range (minVal, maxVal) with any value in myValue (mean, median o what you need)

# Could be any value for limits, i.e. 
minVal <- boxplot.stats(data$columnX)$stats[1]
maxVal <- boxplot.stats(data$columnX)$stats[5]
myValue <- median(data$columnX)

data[data$columnX < minVal | data$columnX > maxVal, "columnX"] <- myValue   

Upvotes: 0

Related Questions