HowdyDude
HowdyDude

Reputation: 483

Replacing Defined Outliers Using Apply/Tapply R

Good Afternoon R wizards,

I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R

The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.

I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"

set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)

Winsor75 <- function(x) {
  Max <- quantile(x, probs = .75)

    return(Max)
}
y <- as.data.frame(x)

y$a[y$a > Winsor75(x)] <- Winsor75(x)

The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.

Any suggestions?

Thanks!

Upvotes: 2

Views: 249

Answers (2)

MKR
MKR

Reputation: 20095

One option is to use mutate_all with custom function and apply rules to all columns.

Approach:

I have crated an replaceOutlier function (based on OPs function) which calculatesMaxand then replaces any item which is more thanMaxbefore returning vector.replaceOutlieris applied over all columns usingdplyr::mutate_all`.

library(tidyverse)

replaceOutlier <- function(x) {
  Max <- quantile(x, probs = .75)
  x[x>Max] <- Max
  return(x)
}

x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))

#Results
# # A tibble: 10 x 2
#     a       b
#   <dbl>   <dbl>
# 1 -0.626  1.08  
# 2  0.698  0.390 
# 3 -0.836 -0.621 
# 4  0.698  1.08  
# 5  0.330  1.08  
# 6 -0.820 -0.0449
# 7  0.487 -0.0162
# 8  0.698  0.944 
# 9  0.576  0.821 
# 10 -0.305  0.594 
# 

Data

set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")

Upvotes: 0

ngm
ngm

Reputation: 2589

as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))

As a function:

df_winsor <- function(df, p) {
  as.data.frame(lapply(df, 
                       function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
} 

Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".

Upvotes: 1

Related Questions