psysky
psysky

Reputation: 3195

remove outliers by group in R

In my dataset, i must delete outliers for each group separately. Here my dataset

vpg=structure(list(customer = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L), code = c(2L, 2L, 3L, 3L, 4L, 4L, 
5L, 5L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L), year = c(2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L), stuff = c(10L, 20L, 30L, 
40L, 50L, 60L, 70L, 80L, 10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L
), action = c(0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 
0L, 1L, 0L, 1L)), .Names = c("customer", "code", "year", "stuff", 
"action"), class = "data.frame", row.names = c(NA, -16L))

I must delete outlier from stuff variable, but separately by group customer+code+year

i found this pretty function

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)
View(vpg)

But it works for all groups. How use this function to delete outlier for each group and get clear dataset for next working ? Note , in this dataset, there is variable action(it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.

Upvotes: 3

Views: 10895

Answers (4)

Lilian Sanselme
Lilian Sanselme

Reputation: 21

Using library(tidyverse), you can define the function

add_new_column <- function(df) {
  new <- remove_outliers(df$stuff)
  return(cbind(new,df))
}

and then apply it group-wise on your whole dataframe:

vpg %>%
  group_by(customer, code, year) %>%
  nest() %>%
  mutate(data = map(data, my_function)) %>%
  unnest()

Upvotes: 2

akrun
akrun

Reputation: 887951

Here is an option using tidyverse

library(dplyr)
vpg %>%
  group_by_at(names(.)[1:3]) %>% 
  mutate(new = case_when(action == 0 ~ remove_outliers(stuff), TRUE ~ stuff))

Upvotes: 2

Terru_theTerror
Terru_theTerror

Reputation: 5017

Try this solution:

Build a function incorporating function remove_outliers working by customer+code+year

f<-function(x,vpg)
{
  select<-paste0(vpg$customer,vpg$code,vpg$year)==x
  out<-suppressWarnings(cbind(vpg[select,c("customer","code","year")][1,],remove_outliers(vpg[select,"stuff"])))
  return(out)
}

Iterate over all triplets customer+code+year

uniq<-as.character(unique(paste0(vpg$customer,vpg$code,vpg$year)))
bind_rows(lapply(uniq,f,vpg=vpg))

Upvotes: 1

jogo
jogo

Reputation: 12569

Here is a solution with data.table:

library("data.table")
setDT(vpg)
vpg[, new:=stuff][action==0, new:=remove_outliers(stuff), by=.(customer, code, year)]

Upvotes: 2

Related Questions