Reputation: 483
Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!
Upvotes: 2
Views: 249
Reputation: 20095
One option is to use mutate_all
with custom
function and apply rules to all columns.
Approach:
I have crated an replaceOutlier
function (based on OPs function) which calculates
Maxand then replaces any item which is more than
Maxbefore returning vector.
replaceOutlieris applied over all columns using
dplyr::mutate_all`.
library(tidyverse)
replaceOutlier <- function(x) {
Max <- quantile(x, probs = .75)
x[x>Max] <- Max
return(x)
}
x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))
#Results
# # A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 -0.626 1.08
# 2 0.698 0.390
# 3 -0.836 -0.621
# 4 0.698 1.08
# 5 0.330 1.08
# 6 -0.820 -0.0449
# 7 0.487 -0.0162
# 8 0.698 0.944
# 9 0.576 0.821
# 10 -0.305 0.594
#
Data
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
Upvotes: 0
Reputation: 2589
as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))
As a function:
df_winsor <- function(df, p) {
as.data.frame(lapply(df,
function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
}
Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".
Upvotes: 1