Reputation: 4229
I'm using these functions to remove outliers:
calcul.mad <- function(x) {
mad <- median(abs(x-median(x, na.rm=TRUE)))
mad}
uper.interval <- function(x,y) {
up.inter <- median(x, na.rm=TRUE)+5*(y)
up.inter}
lower.interval <- function(x,y) {
low.inter <- median(x, na.rm=TRUE)-5*(y)
low.inter}
functionData <- function(x,h,l) {
out <- ifelse(x > h, h, ifelse(x < l, l, x))
out}
Currently, I'm processing each column of a dataframe individually. Is there neat solution to apply these function across all columns of dataframe?
med_data <- median(column1, na.rm=TRUE)
cal_mad <- calcul.mad(column1)
up_data <- uper.interval(med_data, cal_mad)
low_data <- lower.interval(med_data, cal_mad)
column_without_outliers <- data.frame(functionData(column1, up_data, low_data))
Sample dataframe:
data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))
Upvotes: 0
Views: 3864
Reputation: 900
You can simplify this with something like following
data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))
library(tidyverse)
rmOutlier <- function(x){
low <- median(x, na.rm=TRUE)-5*(mad(x))
high <- median(x, na.rm=TRUE)+5*(mad(x))
out <- if_else(x > high, NA,ifelse(x < low, low, x))
out }
data_f2 <- map_df(data_f, rmOutlier)
There is mad
function available in stats package for calcul.mad
function that you have created. You can always use your own.
Upvotes: 0
Reputation: 713
You can use apply
on a data frame.
The code below changes the trim from 5x to 2x, since it's very unlikely a normal distribution will have values that distant from the median.
data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))
calcul.mad <- function(x) {
mad <- median(abs(x-median(x, na.rm=TRUE)))
mad}
uper.interval <- function(x,y) {
up.inter <- median(x, na.rm=TRUE)+2*(y)
up.inter}
lower.interval <- function(x,y) {
low.inter <- median(x, na.rm=TRUE)-2*(y)
low.inter}
functionData <- function(x,h,l) {
out <- ifelse(x > h, h, ifelse(x < l, l, x))
out}
outlier.fun <- function(column1) {
med_data <- median(column1, na.rm=TRUE)
cal_mad <- calcul.mad(column1)
up_data <- uper.interval(med_data, cal_mad)
low_data <- lower.interval(med_data, cal_mad)
column_without_outliers <- functionData(column1, up_data, low_data)
return(column_without_outliers)
}
data_f_noout <- apply(data_f, 2, outlier.fun)
summary(data_f)
summary(data_f_noout)
Upvotes: 1
Reputation: 887901
You may try summarise_each
from dplyr
and apply the median
and calcul.mad
. Once we got that, we can calculate the uper.interval
and lower.interval
after reshaping the Sum_f1
. Then, apply the function_Data
once we get all the values.
library(dplyr)
Sum_f1 <- summarise_each(data_f,funs(median, calcul.mad))
n <- 2*ncol(data_f)
dl <- reshape(Sum_f1, idvar='id', direction='long', sep="_",
varying=split(seq(n), as.numeric(gl(n,n/2,n))))
up_data <- mapply(uper.interval, dl[,2], dl[,3])
low_data <- mapply(lower.interval, dl[,2], dl[,3])
data_f1 <- data_f
data_f1[] <- Map(functionData, data_f, up_data, low_data)
Upvotes: 0