MYjx
MYjx

Reputation: 4407

Change variable values by groups using dplyr

My question is I want to change all the missing values to the mean of each group for multiple columns. I want to use dplyr but it does not work for me

For example

iris2 <- iris
set.seed(1)
iris2[-5] <- lapply(iris2[-5], function(x) {
  x[sample(length(x), sample(10, 1))] <- NA
  x
})

impute_missing=function(x){
    x[is.na(x)]=mean(x,na.rm=TRUE)
    return(x)
}

iris2 %>% groupby (Species) %>% sapply(impute_missing)

However the codes did not impute the missing by Species but by the mean of all the non-missing values of each column. Another weird thin is the function was also applied to Species the group variable. Is there any way to impute the mean by species and keep a complete dataframe/

Upvotes: 4

Views: 4482

Answers (1)

akrun
akrun

Reputation: 887148

Try:

 library(dplyr)
 iris2New <- iris2 %>% 
                   group_by(Species) %>%
                   mutate_each(funs(mean=mean(., na.rm=TRUE)), contains("."))

 iris2[,-5][is.na(iris2)[,-5]] <- iris2New[,-5][is.na(iris2)[,-5]]

 iris2

Or, you could use ifelse on the initial dataset iris2

  fun1 <- function(x) ifelse(is.na(x), mean(x, na.rm=TRUE), x)
  iris3 <-  iris2 %>% 
                  group_by(Species) %>% 
                  mutate_each(funs(fun1), contains(".") )

  identical(as.data.frame(iris3), iris2)
  #[1] TRUE

Or, instead of a function, you can use

 iris4 <-  iris2 %>% 
                 group_by(Species) %>% 
                 mutate_each(funs(ifelse(is.na(.), mean(., na.rm=TRUE), .)), contains(".") )


 identical(iris3,iris4)
 #[1] TRUE

Upvotes: 4

Related Questions