PaulBeales
PaulBeales

Reputation: 505

Impute Column Outliers with Column median within a Dataframe

is there a known function for doing this..? Id like to apply this to certain columns within my numerical dataframe so that outliers are replaced with Column median.

Upvotes: 1

Views: 2281

Answers (1)

LyzandeR
LyzandeR

Reputation: 37879

Take a look at this example. One way would be the following:

Assuming that you are using the interquartile range to identify outliers you could do this:

Example data:

#the first 3 rows are outliers here in both columns
set.seed(100)
mydf <- data.frame(a = c(1000,1000,1000,runif(10)), b= c(1000,1000,1000,runif(10)))

I am using the following function that essentially converts the outliers of each column (outlier is any point that is less than the 25% quartile minus 1.5 times the IQR OR more than the 75% quartile plus 1.5 times the IQR) into the median:

outlier <- function(x) {
 x[x < quantile(x,0.25) - 1.5 * IQR(x) | x > quantile(x,0.75) + 1.5 * IQR(x)] <- median(x)
 x
}

Output (using lapply to apply to each column):

> mydf[] <- lapply(mydf, outlier)
> mydf
            a         b
1  0.48377074 0.6690217
2  0.48377074 0.6690217
3  0.48377074 0.6690217
4  0.30776611 0.6249965
5  0.25767250 0.8821655
6  0.55232243 0.2803538
7  0.05638315 0.3984879
8  0.46854928 0.7625511
9  0.48377074 0.6690217
10 0.81240262 0.2046122
11 0.37032054 0.3575249
12 0.54655860 0.3594751
13 0.17026205 0.6902905

As you can see the outliers (values of 1000 in the original data.frame i.e. the first three rows in both columns) have been replaced with the median.

Upvotes: 1

Related Questions