Reputation: 505
is there a known function for doing this..? Id like to apply this to certain columns within my numerical dataframe so that outliers are replaced with Column median.
Upvotes: 1
Views: 2281
Reputation: 37879
Take a look at this example. One way would be the following:
Assuming that you are using the interquartile range to identify outliers you could do this:
Example data:
#the first 3 rows are outliers here in both columns
set.seed(100)
mydf <- data.frame(a = c(1000,1000,1000,runif(10)), b= c(1000,1000,1000,runif(10)))
I am using the following function that essentially converts the outliers of each column (outlier is any point that is less than the 25% quartile minus 1.5 times the IQR OR more than the 75% quartile plus 1.5 times the IQR) into the median:
outlier <- function(x) {
x[x < quantile(x,0.25) - 1.5 * IQR(x) | x > quantile(x,0.75) + 1.5 * IQR(x)] <- median(x)
x
}
Output (using lapply
to apply to each column):
> mydf[] <- lapply(mydf, outlier)
> mydf
a b
1 0.48377074 0.6690217
2 0.48377074 0.6690217
3 0.48377074 0.6690217
4 0.30776611 0.6249965
5 0.25767250 0.8821655
6 0.55232243 0.2803538
7 0.05638315 0.3984879
8 0.46854928 0.7625511
9 0.48377074 0.6690217
10 0.81240262 0.2046122
11 0.37032054 0.3575249
12 0.54655860 0.3594751
13 0.17026205 0.6902905
As you can see the outliers (values of 1000 in the original data.frame i.e. the first three rows in both columns) have been replaced with the median.
Upvotes: 1