rpl
rpl

Reputation: 451

R: summarise multiple column (numeric, character) and remove NAs

I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.

I need to:

Let's say, we're using modified iris data as below:

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs

I have ~50 columns in total, numeric and character mixed together. I've been trying something like:

giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)

What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...

Help anyone?

Upvotes: 13

Views: 8390

Answers (2)

talat
talat

Reputation: 70296

You could try this with an if/else in the funs of summarise_each:

iris %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Since you have some NA's also in grouping columns, you could add a filter statement:

iris %>% 
  filter(!is.na(Species) & !is.na(year)) %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
#     Species  year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#      (fctr) (dbl)        (dbl)       (dbl)        (dbl)       (dbl) (chr)
#1     setosa  2000        5.025    3.479167       1.4625       0.250   red
#2     setosa  3000        4.984    3.376000       1.4640       0.244   red
#3 versicolor  2000        6.012    2.776000       4.3120       1.344 green
#4 versicolor  3000        5.860    2.764000       4.2080       1.308 green
#5  virginica  2000        6.576    2.928000       5.6400       2.044  blue
#6  virginica  3000        6.600    3.020000       5.4640       2.008  blue

To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.)).


You could also try data.table:

library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
     if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}), 
     by = list(Species, year)]
#      Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1:     setosa 2000        5.025    3.479167       1.4625       0.250   red
#2:     setosa 3000        4.984    3.376000       1.4640       0.244   red
#3: versicolor 2000        6.012    2.776000       4.3120       1.344 green
#4: versicolor 3000        5.860    2.764000       4.2080       1.308 green
#5:  virginica 2000        6.576    2.928000       5.6400       2.044  blue
#6:  virginica 3000        6.600    3.020000       5.4640       2.008  blue

Upvotes: 18

Phann
Phann

Reputation: 1327

I give it a try:

1. For the first point you mention, I would do something like the following (which isn't necessary for the second point):

na.omit(iris[ , which(sapply(iris, class) == "numeric")])

To separate the columns bei either numeric or character, I use the following:

iris[ , which(sapply(iris, class) == "numeric")]
iris[ , which(sapply(iris, class) == "character")]

2. The second task I combine the above line with colMeans:

colMeans(iris[ , which(sapply(iris, class) == "numeric")], na.rm = TRUE)

3. To extract the first element of the character columns, you could simply do:

iris[1, which(sapply(iris, class) == "character")]

In the mentioned case for the iris data, the first row is completly NA, even the character columns, so I would iterate to find the first-non-NA-row

k <- 1
while(any(is.na(FirstCharacterElement <- iris[k, which(sapply(iris, class) == "character")]))){
  k <- k + 1
}

Be careful about the class factor (which break up the code in the case of the iris data where column Species is of class factor and you maybe expect it to be a character column. You can check this with sapply(iris, class) and change it with e.g.

iris$Species <- as.character(iris$Species) #or with similar column names

When you read in the data you can mention the parameter stringsAsFactors = FALSE of functions read.table, read.csv or similar.

Upvotes: 0

Related Questions