Maximilian
Maximilian

Reputation: 4229

Remove outliers for all columns with R

I'm using these functions to remove outliers:

calcul.mad <- function(x) {
mad <- median(abs(x-median(x, na.rm=TRUE))) 
mad}

uper.interval <- function(x,y) {
up.inter <- median(x, na.rm=TRUE)+5*(y) 
up.inter}

lower.interval <- function(x,y) {
low.inter <- median(x, na.rm=TRUE)-5*(y)
low.inter}

functionData <- function(x,h,l) {
out <- ifelse(x > h, h, ifelse(x < l, l, x))
out}

Currently, I'm processing each column of a dataframe individually. Is there neat solution to apply these function across all columns of dataframe?

med_data <- median(column1, na.rm=TRUE)
cal_mad <- calcul.mad(column1)
up_data <- uper.interval(med_data, cal_mad)
low_data <- lower.interval(med_data, cal_mad)
column_without_outliers <- data.frame(functionData(column1, up_data, low_data))

Sample dataframe:

data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))

Upvotes: 0

Views: 3864

Answers (3)

ok1more
ok1more

Reputation: 900

You can simplify this with something like following

data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))

library(tidyverse)

rmOutlier <- function(x){
                        low <- median(x, na.rm=TRUE)-5*(mad(x)) 
                       high <- median(x, na.rm=TRUE)+5*(mad(x))   
                       out <- if_else(x > high, NA,ifelse(x < low, low, x)) 
                       out }

data_f2 <- map_df(data_f, rmOutlier)

There is mad function available in stats package for calcul.mad function that you have created. You can always use your own.

Upvotes: 0

treysp
treysp

Reputation: 713

You can use apply on a data frame.

The code below changes the trim from 5x to 2x, since it's very unlikely a normal distribution will have values that distant from the median.

data_f <- data.frame(col1=rnorm(100,10,10), col2=rnorm(100,15,15), col3=rnorm(100,20,20))

calcul.mad <- function(x) {
mad <- median(abs(x-median(x, na.rm=TRUE))) 
mad}

uper.interval <- function(x,y) {
up.inter <- median(x, na.rm=TRUE)+2*(y) 
up.inter}

lower.interval <- function(x,y) {
low.inter <- median(x, na.rm=TRUE)-2*(y)
low.inter}

functionData <- function(x,h,l) {
out <- ifelse(x > h, h, ifelse(x < l, l, x))
out}


outlier.fun <- function(column1) {
  med_data <- median(column1, na.rm=TRUE)
  cal_mad <- calcul.mad(column1)
  up_data <- uper.interval(med_data, cal_mad)
  low_data <- lower.interval(med_data, cal_mad)
  column_without_outliers <- functionData(column1, up_data, low_data)

  return(column_without_outliers)
  }

data_f_noout <- apply(data_f, 2, outlier.fun)

summary(data_f)
summary(data_f_noout)

Upvotes: 1

akrun
akrun

Reputation: 887901

You may try summarise_each from dplyr and apply the median and calcul.mad. Once we got that, we can calculate the uper.interval and lower.interval after reshaping the Sum_f1. Then, apply the function_Data once we get all the values.

library(dplyr)
Sum_f1 <- summarise_each(data_f,funs(median, calcul.mad))

n <-  2*ncol(data_f)

dl <- reshape(Sum_f1, idvar='id', direction='long', sep="_", 
                      varying=split(seq(n), as.numeric(gl(n,n/2,n))))


up_data <- mapply(uper.interval, dl[,2], dl[,3])
low_data <- mapply(lower.interval, dl[,2], dl[,3])
data_f1 <- data_f
data_f1[] <- Map(functionData, data_f, up_data, low_data)

Upvotes: 0

Related Questions