Reputation: 345
I am trying to find and replace outliers from multiple numeric columns. This is not the best practice in my humble opinion, but it is something I'm attempting to figure out for specific use cases. A great example of creating an additional column that labels a row as an outlier can be found here but it is based on a single column.
My data looks as follows (for simplicity, I excluded columns with factors):
Row ID Value1 Value2
1 6 1
2 2 200
3 100 3
4 1 4
5 250 5
6 2 6
7 8 300
8 600 300
9 2 9
I used a function to replace outliers with NA in all numeric columns:
replaceOuts = function(df) {
map_if(df, is.numeric,
~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>%
bind_cols
}
test = replaceOuts(df)
My question is how can I replace the outliers with another value (e.g., mean, median, capped value, etc.)? Any help would be appreciated!
Upvotes: 0
Views: 1136
Reputation: 388982
Instead of NA
you could replace the value with mean
or median
whatever you prefer.
library(dplyr)
library(purrr)
replaceOuts = function(df) {
map_if(df, is.numeric,
~ replace(.x, .x %in% boxplot.stats(.x)$out, mean(.x))) %>%
bind_cols
}
replaceOuts(df)
# RowID Value1 Value2
# <dbl> <dbl> <dbl>
#1 1 6 1
#2 2 2 200
#3 3 100 3
#4 4 1 4
#5 5 108. 5
#6 6 2 6
#7 7 8 300
#8 8 108. 300
#9 9 2 9
Replace mean
with median
to any other function that you want.
PS - I think it is better to use mutate_if
instead of map_if
here since it avoids bind_cols
at the end.
df %>% mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, mean(.)))
Upvotes: 1
Reputation: 306
I think you need minVal and maxMax treshold values. And then replace values out of range (minVal, maxVal) with any value in myValue (mean, median o what you need)
# Could be any value for limits, i.e.
minVal <- boxplot.stats(data$columnX)$stats[1]
maxVal <- boxplot.stats(data$columnX)$stats[5]
myValue <- median(data$columnX)
data[data$columnX < minVal | data$columnX > maxVal, "columnX"] <- myValue
Upvotes: 0