Reputation: 572
I have a dataframe where I would like to first group on a particular column (ID) and then remove the outliers from a particular column (Number) based on group and then calculate the mean for each group.
library(dplyr)
id<-c("A","B","C","A","B","B")
id<-as.data.frame(id)
number <-c(5,10,2,6,1000,12)
number<-as.data.frame(number)
total<-cbind(id,number)
I tried below approach but it is not working
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
y
}
df2 <- total %>%
group_by(id) %>%
mutate(mean_val = remove_outliers(number)) %>%
ungroup() %>%
filter(!is.na(mean_val))
I would appreciate if someone could help
Input and expected O/P
Upvotes: 0
Views: 763
Reputation: 389325
There are not enough observations in your B
group to treat 1000 as outlier.
See,
remove_outliers(c(5, 1000, 12))
#[1] 5 1000 12
However, if you add one more observation it treats 1000 as outlier.
remove_outliers(c(5, 1000, 12, 6))
#[1] 5 NA 12 6
So in general something like this should give you the expected output :
library(dplyr)
total %>%
group_by(id) %>%
mutate(mean_val = remove_outliers(number)) %>%
filter(!is.na(mean_val)) %>%
mutate(mean_val = mean(mean_val)) %>%
ungroup()
Upvotes: 5