Dexter1611
Dexter1611

Reputation: 572

remove outliers after group by and then calculate mean for each group

I have a dataframe where I would like to first group on a particular column (ID) and then remove the outliers from a particular column (Number) based on group and then calculate the mean for each group.

library(dplyr)
id<-c("A","B","C","A","B","B")
id<-as.data.frame(id)
number <-c(5,10,2,6,1000,12)
number<-as.data.frame(number)
total<-cbind(id,number)




I tried below approach but it is not working

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
  val <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - val)] <- NA
  y[x > (qnt[2] + val)] <- NA
  y
}


df2 <- total %>% 
  group_by(id) %>% 
  mutate(mean_val = remove_outliers(number)) %>% 
  ungroup() %>% 
  filter(!is.na(mean_val))

I would appreciate if someone could help

Input and expected O/P

enter image description here

Upvotes: 0

Views: 763

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389325

There are not enough observations in your B group to treat 1000 as outlier.

See,

remove_outliers(c(5, 1000, 12))
#[1]    5 1000   12

However, if you add one more observation it treats 1000 as outlier.

remove_outliers(c(5, 1000, 12, 6))
#[1]  5 NA 12  6

So in general something like this should give you the expected output :

library(dplyr)

total %>% 
  group_by(id) %>% 
  mutate(mean_val = remove_outliers(number)) %>% 
  filter(!is.na(mean_val)) %>%
  mutate(mean_val = mean(mean_val)) %>%
  ungroup()

Upvotes: 5

Related Questions