R: summarise multiple column (numeric, character) and remove NAs

Question

I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.

I need to:

remove NAs from numeric columns
calculate the mean of each of the numeric columns
extract the first element of the character columns

Let's say, we're using modified iris data as below:

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs

I have ~50 columns in total, numeric and character mixed together. I've been trying something like:

giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)

What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...

Help anyone?

talat · Accepted Answer

You could try this with an if/else in the funs of summarise_each:

iris %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Since you have some NA's also in grouping columns, you could add a filter statement:

iris %>% 
  filter(!is.na(Species) & !is.na(year)) %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
#     Species  year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#      (fctr) (dbl)        (dbl)       (dbl)        (dbl)       (dbl) (chr)
#1     setosa  2000        5.025    3.479167       1.4625       0.250   red
#2     setosa  3000        4.984    3.376000       1.4640       0.244   red
#3 versicolor  2000        6.012    2.776000       4.3120       1.344 green
#4 versicolor  3000        5.860    2.764000       4.2080       1.308 green
#5  virginica  2000        6.576    2.928000       5.6400       2.044  blue
#6  virginica  3000        6.600    3.020000       5.4640       2.008  blue

To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.)).

You could also try data.table:

library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
     if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}), 
     by = list(Species, year)]
#      Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1:     setosa 2000        5.025    3.479167       1.4625       0.250   red
#2:     setosa 3000        4.984    3.376000       1.4640       0.244   red
#3: versicolor 2000        6.012    2.776000       4.3120       1.344 green
#4: versicolor 3000        5.860    2.764000       4.2080       1.308 green
#5:  virginica 2000        6.576    2.928000       5.6400       2.044  blue
#6:  virginica 3000        6.600    3.020000       5.4640       2.008  blue

R: summarise multiple column (numeric, character) and remove NAs

Answers (2)

Related Questions